Skip to content

Add deduplication logic #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add deduplication logic #1

wants to merge 3 commits into from

Conversation

PaliC
Copy link
Collaborator

@PaliC PaliC commented Jun 23, 2025

This pull request introduces deduplication functionality to the export.py script and updates the documentation to include testing instructions. The key changes include integrating a deduplication module, handling deduplicated datasets, and providing a detailed test setup for verifying the deduplication logic.

The deduplication functionality is specifically does 3 things

  1. It bins the submissions by success, run mode, and score (if available) or duration of the run. We then deduplicate within these bins as minhash takes a while (deduping this dataset took an hour)
  2. We first dedup using a hash
  3. We then dedup using minhash lsh (this is what takes a while). You can find an explanation of this process here https://medium.com/@omkarsoak/from-min-hashing-to-locality-sensitive-hashing-the-complete-process-b88b298d71a1

Deduplication Integration:

  • Added the dedup_df function from the dedup module to deduplicate submissions in the dataset. The deduplicated data is saved as a separate Parquet file, and the process is logged with details about the number of records before and after deduplication. (export.py, [1] [2]
  • Extended the script to create and save a deduplicated version of successful submissions, including logging the count of records pre- and post-deduplication. (export.py, export.pyR235-R243)

Documentation Updates:

  • Added a "Tests" section in the README.md file, describing how to test the deduplication scripts using a fake dataset with various features such as exact duplicates, fuzzy duplicates, and realistic structured data. (README.md, README.mdR37-R51)

@PaliC PaliC marked this pull request as ready for review June 24, 2025 00:01
@msaroufim
Copy link
Member

Thanks! Mind sharing some more stats on the real dataset?

  1. How much data was around before your change
  2. How much data is around after
  3. Some manual vibe checks as well of kernels that were filtered out would be nice to sanity check

Code is quite long to review but at least above should give us some more confidence before merge

# For leaderboard mode with successful runs, prefer higher scores
if run_mode == 'leaderboard' and row.get('run_passed') == True:
if row.get('run_score', 0) > existing_row.get('run_score', 0):
unique_entries[content_hash] = row
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think scores are still lower = better; run.duration is the end-to-end wallclock time for the entire run, including, e.g., testing code, whereas score is the geomean of all benchmarks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I see. I'll rerun things and reupload

@PaliC
Copy link
Collaborator Author

PaliC commented Jun 24, 2025

@msaroufim for the first two questions we have these

✓ Loaded submissions.parquet: 40,095 entries
✓ Loaded deduplicated_submissions.parquet: 15,238 entries

============================================================
COUNT COMPARISON

Original submissions: 40,095
Deduplicated submissions: 15,238
Removed entries: 24,857
Percentage removed: 62.00%

So it seems like a lot of the kernels are actually similar. I'll rerun things and grab some files for a sanity check

@PaliC
Copy link
Collaborator Author

PaliC commented Jun 24, 2025

@msaroufim https://www.diffchecker.com/KamzTAeT/ (I think this one is more similar) and https://www.diffchecker.com/MtK5pbWL/ show a entry which was removed due to depulicaton (on right) compared to two other entries that remained in the dataset. We can be a bit less aggressive with deduping, as they look sortof different.

@PaliC PaliC requested a review from ngc92 June 24, 2025 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants