-
Notifications
You must be signed in to change notification settings - Fork 1
Add deduplication logic #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks! Mind sharing some more stats on the real dataset?
Code is quite long to review but at least above should give us some more confidence before merge |
# For leaderboard mode with successful runs, prefer higher scores | ||
if run_mode == 'leaderboard' and row.get('run_passed') == True: | ||
if row.get('run_score', 0) > existing_row.get('run_score', 0): | ||
unique_entries[content_hash] = row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scores are still lower = better; run.duration is the end-to-end wallclock time for the entire run, including, e.g., testing code, whereas score is the geomean of all benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I see. I'll rerun things and reupload
@msaroufim for the first two questions we have these ✓ Loaded submissions.parquet: 40,095 entries ============================================================
|
@msaroufim https://www.diffchecker.com/KamzTAeT/ (I think this one is more similar) and https://www.diffchecker.com/MtK5pbWL/ show a entry which was removed due to depulicaton (on right) compared to two other entries that remained in the dataset. We can be a bit less aggressive with deduping, as they look sortof different. |
This pull request introduces deduplication functionality to the
export.py
script and updates the documentation to include testing instructions. The key changes include integrating a deduplication module, handling deduplicated datasets, and providing a detailed test setup for verifying the deduplication logic.The deduplication functionality is specifically does 3 things
Deduplication Integration:
dedup_df
function from thededup
module to deduplicate submissions in the dataset. The deduplicated data is saved as a separate Parquet file, and the process is logged with details about the number of records before and after deduplication. (export.py
, [1] [2]export.py
, export.pyR235-R243)Documentation Updates:
README.md
file, describing how to test the deduplication scripts using a fake dataset with various features such as exact duplicates, fuzzy duplicates, and realistic structured data. (README.md
, README.mdR37-R51)