Add deduplication logic #1

PaliC · 2025-06-23T20:57:28Z

This pull request introduces deduplication functionality to the export.py script and updates the documentation to include testing instructions. The key changes include integrating a deduplication module, handling deduplicated datasets, and providing a detailed test setup for verifying the deduplication logic.

The deduplication functionality is specifically does 3 things

It bins the submissions by success, run mode, and score (if available) or duration of the run. We then deduplicate within these bins as minhash takes a while (deduping this dataset took an hour)
We first dedup using a hash
We then dedup using minhash lsh (this is what takes a while). You can find an explanation of this process here https://medium.com/@omkarsoak/from-min-hashing-to-locality-sensitive-hashing-the-complete-process-b88b298d71a1

Deduplication Integration:

Added the dedup_df function from the dedup module to deduplicate submissions in the dataset. The deduplicated data is saved as a separate Parquet file, and the process is logged with details about the number of records before and after deduplication. (export.py, [1] [2]
Extended the script to create and save a deduplicated version of successful submissions, including logging the count of records pre- and post-deduplication. (export.py, export.pyR235-R243)

Documentation Updates:

Added a "Tests" section in the README.md file, describing how to test the deduplication scripts using a fake dataset with various features such as exact duplicates, fuzzy duplicates, and realistic structured data. (README.md, README.mdR37-R51)

msaroufim · 2025-06-24T00:06:53Z

Thanks! Mind sharing some more stats on the real dataset?

How much data was around before your change
How much data is around after
Some manual vibe checks as well of kernels that were filtered out would be nice to sanity check

Code is quite long to review but at least above should give us some more confidence before merge

ngc92 · 2025-06-24T12:43:48Z

dedup.py

+                        # For leaderboard mode with successful runs, prefer higher scores
+                        if run_mode == 'leaderboard' and row.get('run_passed') == True:
+                            if row.get('run_score', 0) > existing_row.get('run_score', 0):
+                                unique_entries[content_hash] = row


I think scores are still lower = better; run.duration is the end-to-end wallclock time for the entire run, including, e.g., testing code, whereas score is the geomean of all benchmarks

oops, I see. I'll rerun things and reupload

PaliC · 2025-06-24T12:45:22Z

@msaroufim for the first two questions we have these

✓ Loaded submissions.parquet: 40,095 entries
✓ Loaded deduplicated_submissions.parquet: 15,238 entries

============================================================
COUNT COMPARISON

Original submissions: 40,095
Deduplicated submissions: 15,238
Removed entries: 24,857
Percentage removed: 62.00%

So it seems like a lot of the kernels are actually similar. I'll rerun things and grab some files for a sanity check

PaliC · 2025-06-24T15:26:33Z

@msaroufim https://www.diffchecker.com/KamzTAeT/ (I think this one is more similar) and https://www.diffchecker.com/MtK5pbWL/ show a entry which was removed due to depulicaton (on right) compared to two other entries that remained in the dataset. We can be a bit less aggressive with deduping, as they look sortof different.

PaliC added 2 commits June 23, 2025 16:56

Add deduplication logic

15c53eb

magic number removal

2ad5e80

PaliC marked this pull request as ready for review June 24, 2025 00:01

Add deduplicated datasets

a30f85d

ngc92 reviewed Jun 24, 2025

View reviewed changes

PaliC requested a review from ngc92 June 24, 2025 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add deduplication logic #1

Add deduplication logic #1

PaliC commented Jun 23, 2025 •

edited

Loading

Uh oh!

msaroufim commented Jun 24, 2025

Uh oh!

ngc92 Jun 24, 2025

Uh oh!

PaliC Jun 24, 2025

Uh oh!

PaliC commented Jun 24, 2025

Uh oh!

PaliC commented Jun 24, 2025

Uh oh!

Uh oh!

Add deduplication logic #1

Are you sure you want to change the base?

Add deduplication logic #1

Conversation

PaliC commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deduplication Integration:

Documentation Updates:

Uh oh!

msaroufim commented Jun 24, 2025

Uh oh!

ngc92 Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

PaliC Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

PaliC commented Jun 24, 2025

============================================================ COUNT COMPARISON

Uh oh!

PaliC commented Jun 24, 2025

Uh oh!

Uh oh!

PaliC commented Jun 23, 2025 •

edited

Loading

============================================================
COUNT COMPARISON