Skip to content

Now, dpo.py matches dpo_tune_cache.py almost perfectly on the single GPU experiments#1451

Merged
finbarrtimbers merged 59 commits intomainfrom
finbarr/dpo-match-single-gpu
Feb 18, 2026
Merged

Now, dpo.py matches dpo_tune_cache.py almost perfectly on the single GPU experiments#1451
finbarrtimbers merged 59 commits intomainfrom
finbarr/dpo-match-single-gpu

Conversation

@finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Feb 1, 2026

Colab with plots, particularly loss curve:

Screenshot 2026-02-01 at 9 06 23 AM

Changes:

  1. In HFDataLoader, switched to match the dpo_tune_cache.py iteration order by using torch.randperm with seeded torch.Generator
  2. Fix attention masking in the dpo.py forwards pass by converting padded batches to the packed format. We now pass doc_lens and max_doc_lens, enabling proper intra-document attention masking.
  3. Fix weight loading order by loading after paralellization (this was previously re-initializing the weights, whoops).
  4. Use 0-based epoch numbers to match dpo_tune_cache.py.
  5. Use Duration.steps instead of Duration.epochs in dpo.py so we do max_train_steps training steps.

Changes not related to making DPO match:

  1. Fixed the token_count metric in dpo_tune_cache.py to undo the mean reduction.
  2. Make data ordering reproducible by using a seeded RandomSampler with torch.Generator().manual_seed(seed).

Known areas of deviation between the two:

  1. dpo.py sets drop_last=True while dpo_tune_cache.py doesn't.
  2. dpo.py accumulates gradients using a token-weighted mean, while dpo_tune_cache.py uses uniform weighting.

We think that dpo.py does things correctly.


Note

Medium Risk
Changes core DPO training semantics (data order, masking, checkpoint weight loading, and step scheduling), which can materially affect convergence and reproducibility; scope is contained to DPO paths and covered by updated GPU tests.

Overview
Makes dpo.py’s single-GPU behavior match dpo_tune_cache.py more closely by (1) switching HFDataLoader shuffling to seeded torch.randperm for reproducible ordering, (2) fixing OLMo-core attention masking by repacking padded batches into doc_lens/max_doc_lens format in dpo_utils forwards, and (3) reloading HF checkpoint weights after TransformerTrainModule parallelization to avoid accidental re-init.

Also updates training control to be step-based (max_train_steps via Duration.steps + scheduler step count), forces drop_last=True for DPO data/cache loaders, makes dpo_tune_cache.py sampling reproducible with a seeded RandomSampler, and corrects its token_count aggregation after mean reduction; tests and debug script are adjusted accordingly.

Written by Cursor Bugbot for commit c5cb1bb. This will update automatically on new commits. Configure here.

finbarrtimbers and others added 30 commits January 30, 2026 09:34
Add a version of the single GPU DPO script that calls dpo_tune_cache.py
instead of dpo.py, to compare metrics between the two implementations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move beaker_config initialization outside conditional blocks so it's
always defined when needed for experiment config updates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds missing metrics that are tracked in dpo_tune_cache.py:
- train/rewards_average: Average of chosen and rejected rewards
- train/token_count: Sum of non-padded tokens in chosen + rejected

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Avoid port 29500 conflicts on single GPU jobs by disabling host networking.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove invalid main_process_only argument from logger.info().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update the function call to match the current signature in dpo_utils.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Correctly describe it as using accelerate (dpo_tune_cache.py), not OLMo-core.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This makes dpo.py count optimizer updates as steps (like dpo_tune_cache.py)
instead of counting micro-batches as steps.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep incomplete last batch to match accelerate's default behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…and dpo_tune_cache.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
OLMo-core Transformer doesn't support attention_mask parameter but uses
cu_doc_lens for intra-document attention masking. This change adds a
pack_padded_sequences helper function that converts padded batches to
packed format with cumulative document lengths.

Both concatenated_forward_olmo and separate_forward_olmo now properly
handle padding by packing sequences on-the-fly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…e_cache.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When packing=False, OLMo-core now unpacks logits back to padded format
and uses _get_batch_logps (same as HuggingFace) instead of pf_get_batch_logps.
This ensures consistent logprob computation between dpo.py and dpo_tune_cache.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…he.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split large batches into micro-batches of size per_device_train_batch_size
and process them one at a time with gradient accumulation. This ensures
dpo.py (OLMo-core) and dpo_tune_cache.py (HuggingFace) process the same
number of samples per forward pass.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Log input_ids, attention_mask/cu_doc_lens, labels, logits, and logprobs
for both HuggingFace and OLMo-core forward functions to diagnose
the logprob differences between dpo.py and dpo_tune_cache.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Log the first 5 values and mean of embedding weights to verify
whether the model weights are identical between implementations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use .detach().float().cpu() to convert DTensor to regular tensor
before calling .tolist() for logging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
load_hf_model() loads weights into the provided state_dict, but
model.state_dict() returns a copy, not a reference. The modified
state_dict was never loaded back into the model, leaving it with
randomly initialized weights.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change HFDataLoader to use NumPy RNG (np.random.default_rng) instead of
  PyTorch RNG to match HuggingFace Dataset.shuffle() behavior
- Remove shuffle=True from DataLoader in dpo_tune_cache.py to avoid double
  shuffling (dataset is already shuffled)
- Add debug logging to verify data indices match

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The dataset was being shuffled twice:
1. Via HF Dataset.shuffle() before passing to HFDataLoader
2. Via numpy permutation inside HFDataLoader._reshard()

Now only HFDataLoader._reshard() shuffles, matching dpo_tune_cache.py behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep the original double-shuffling behavior as the DataLoader needs to
shuffle during iteration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
dpo_tune_cache.py does:
1. dataset.shuffle(seed) - HF Dataset shuffle
2. DataLoader(shuffle=True) - PyTorch DataLoader shuffle

Now dpo.py does:
1. dataset.shuffle(seed) - HF Dataset shuffle (restored)
2. HFDataLoader._reshard() with PyTorch RNG (torch.randperm)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The torch RNG state gets consumed by model loading between set_seed()
and DataLoader creation. Reseeding ensures the DataLoader shuffle
uses a fresh RNG state matching HFDataLoader's behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cannot modify dpo_tune_cache.py - need to match its behavior from dpo.py side.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Makes the DataLoader shuffle reproducible by using a Generator seeded
with args.seed, matching HFDataLoader's behavior in dpo.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Log logits at first label position for chosen and rejected to help
debug differences between HF and OLMo-core implementations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
finbarrtimbers and others added 3 commits February 11, 2026 07:57
Previously compared mean chosen vs mean rejected rewards (always 0 or 1).
Now computes per-sample accuracy and averages, matching dpo_tune_cache.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers force-pushed the finbarr/dpo-match-single-gpu branch from e2f6ffb to e01505b Compare February 11, 2026 19:24
finbarrtimbers and others added 5 commits February 12, 2026 08:25
Resolve conflicts by adapting branch's micro-batching and rewards_accuracy
fix to main's TransformerTrainModule structure. Keep GRPOTrainModule from
main. Drop debug logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nModule merge

TransformerTrainModule.__init__ calls parallelize_model which calls
init_weights, reinitializing all model weights from scratch. This
destroyed the HF checkpoint loaded in _setup_model. Fix by reloading
HF weights after parallelization.

Also fix micro-batch splitting: use sample_microbatch_size (in samples)
for split_batch_dpo instead of rank_microbatch_size (in tokens), matching
main branch's DPOTrainModule pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gle-gpu

# Conflicts:
#	open_instruct/dpo.py
#	open_instruct/olmo_core_train_modules.py
@github-actions
Copy link
Contributor

Documentation Changes Detected

📄 sitemap.xml
--- site-base/sitemap.xml	2026-02-18 19:40:01.684494357 +0000
+++ site-pr/sitemap.xml	2026-02-18 19:39:59.221033571 +0000
@@ -13,6 +13,10 @@
          <lastmod>2026-02-18</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/dpo_divergence_investigation/</loc>
+         <lastmod>2026-02-18</lastmod>
+    </url>
+    <url>
📄 sitemap.xml.gz
Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

Both scripts keep effective batch size=4 but use per_device_train_batch_size=4
with gradient_accumulation_steps=1, so micro-batch averaging differences
between OLMo-core (token-weighted) and HF (uniform) don't matter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Documentation Changes Detected

📄 sitemap.xml
--- site-base/sitemap.xml	2026-02-18 20:49:05.085290798 +0000
+++ site-pr/sitemap.xml	2026-02-18 20:49:02.708144399 +0000
@@ -13,6 +13,10 @@
          <lastmod>2026-02-18</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/dpo_divergence_investigation/</loc>
+         <lastmod>2026-02-18</lastmod>
+    </url>
+    <url>
📄 sitemap.xml.gz
Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

finbarrtimbers and others added 6 commits February 18, 2026 13:51
- Fix mock model parameter names to match production code (doc_lens/max_doc_lens)
- Use torch RNG instead of numpy in test to match HFDataLoader implementation
- Use effective_steps (max_train_steps when set) for scheduler warmup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm but one comment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers added this pull request to the merge queue Feb 18, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

collator=collator,
device=device,
drop_last=False,
drop_last=True,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache data loader drops examples, causing RuntimeError

High Severity

The cache_data_loader was changed from drop_last=False to drop_last=True. This causes build_reference_logprobs_cache to skip caching some dataset indices (the remainder that doesn't fill a full batch). Since the cache function allocates tensors of size full_dataset_size=len(dataset) and then validates that every index was populated (raising RuntimeError for any -inf entries), this will crash whenever len(dataset) % cache_batch_size != 0.

Fix in Cursor Fix in Web

Merged via the queue into main with commit c4b10fc Feb 18, 2026
7 of 8 checks passed
@finbarrtimbers finbarrtimbers deleted the finbarr/dpo-match-single-gpu branch February 18, 2026 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants