Now, `dpo.py` matches `dpo_tune_cache.py` almost perfectly on the single GPU experiments by finbarrtimbers · Pull Request #1451 · allenai/open-instruct

finbarrtimbers · 2026-02-01T16:07:13Z

Colab with plots, particularly loss curve:

Changes:

In HFDataLoader, switched to match the dpo_tune_cache.py iteration order by using torch.randperm with seeded torch.Generator
Fix attention masking in the dpo.py forwards pass by converting padded batches to the packed format. We now pass doc_lens and max_doc_lens, enabling proper intra-document attention masking.
Fix weight loading order by loading after paralellization (this was previously re-initializing the weights, whoops).
Use 0-based epoch numbers to match dpo_tune_cache.py.
Use Duration.steps instead of Duration.epochs in dpo.py so we do max_train_steps training steps.

Changes not related to making DPO match:

Fixed the token_count metric in dpo_tune_cache.py to undo the mean reduction.
Make data ordering reproducible by using a seeded RandomSampler with torch.Generator().manual_seed(seed).

Known areas of deviation between the two:

dpo.py sets drop_last=True while dpo_tune_cache.py doesn't.
dpo.py accumulates gradients using a token-weighted mean, while dpo_tune_cache.py uses uniform weighting.

We think that dpo.py does things correctly.

Note

Medium Risk
Changes core DPO training semantics (data order, masking, checkpoint weight loading, and step scheduling), which can materially affect convergence and reproducibility; scope is contained to DPO paths and covered by updated GPU tests.

Overview
Makes dpo.py’s single-GPU behavior match dpo_tune_cache.py more closely by (1) switching HFDataLoader shuffling to seeded torch.randperm for reproducible ordering, (2) fixing OLMo-core attention masking by repacking padded batches into doc_lens/max_doc_lens format in dpo_utils forwards, and (3) reloading HF checkpoint weights after TransformerTrainModule parallelization to avoid accidental re-init.

Also updates training control to be step-based (max_train_steps via Duration.steps + scheduler step count), forces drop_last=True for DPO data/cache loaders, makes dpo_tune_cache.py sampling reproducible with a seeded RandomSampler, and corrects its token_count aggregation after mean reduction; tests and debug script are adjusted accordingly.

^{Written by Cursor Bugbot for commit c5cb1bb. This will update automatically on new commits. Configure here.}

Add a version of the single GPU DPO script that calls dpo_tune_cache.py instead of dpo.py, to compare metrics between the two implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Move beaker_config initialization outside conditional blocks so it's always defined when needed for experiment config updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds missing metrics that are tracked in dpo_tune_cache.py: - train/rewards_average: Average of chosen and rejected rewards - train/token_count: Sum of non-padded tokens in chosen + rejected Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Avoid port 29500 conflicts on single GPU jobs by disabling host networking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove invalid main_process_only argument from logger.info(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update the function call to match the current signature in dpo_utils.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Correctly describe it as using accelerate (dpo_tune_cache.py), not OLMo-core. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This makes dpo.py count optimizer updates as steps (like dpo_tune_cache.py) instead of counting micro-batches as steps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Keep incomplete last batch to match accelerate's default behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…and dpo_tune_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

OLMo-core Transformer doesn't support attention_mask parameter but uses cu_doc_lens for intra-document attention masking. This change adds a pack_padded_sequences helper function that converts padded batches to packed format with cumulative document lengths. Both concatenated_forward_olmo and separate_forward_olmo now properly handle padding by packing sequences on-the-fly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…e_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When packing=False, OLMo-core now unpacks logits back to padded format and uses _get_batch_logps (same as HuggingFace) instead of pf_get_batch_logps. This ensures consistent logprob computation between dpo.py and dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…he.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Split large batches into micro-batches of size per_device_train_batch_size and process them one at a time with gradient accumulation. This ensures dpo.py (OLMo-core) and dpo_tune_cache.py (HuggingFace) process the same number of samples per forward pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Log input_ids, attention_mask/cu_doc_lens, labels, logits, and logprobs for both HuggingFace and OLMo-core forward functions to diagnose the logprob differences between dpo.py and dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Log the first 5 values and mean of embedding weights to verify whether the model weights are identical between implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use .detach().float().cpu() to convert DTensor to regular tensor before calling .tolist() for logging. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

load_hf_model() loads weights into the provided state_dict, but model.state_dict() returns a copy, not a reference. The modified state_dict was never loaded back into the model, leaving it with randomly initialized weights. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Change HFDataLoader to use NumPy RNG (np.random.default_rng) instead of PyTorch RNG to match HuggingFace Dataset.shuffle() behavior - Remove shuffle=True from DataLoader in dpo_tune_cache.py to avoid double shuffling (dataset is already shuffled) - Add debug logging to verify data indices match Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The dataset was being shuffled twice: 1. Via HF Dataset.shuffle() before passing to HFDataLoader 2. Via numpy permutation inside HFDataLoader._reshard() Now only HFDataLoader._reshard() shuffles, matching dpo_tune_cache.py behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Keep the original double-shuffling behavior as the DataLoader needs to shuffle during iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dpo_tune_cache.py does: 1. dataset.shuffle(seed) - HF Dataset shuffle 2. DataLoader(shuffle=True) - PyTorch DataLoader shuffle Now dpo.py does: 1. dataset.shuffle(seed) - HF Dataset shuffle (restored) 2. HFDataLoader._reshard() with PyTorch RNG (torch.randperm) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The torch RNG state gets consumed by model loading between set_seed() and DataLoader creation. Reseeding ensures the DataLoader shuffle uses a fresh RNG state matching HFDataLoader's behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Cannot modify dpo_tune_cache.py - need to match its behavior from dpo.py side. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Makes the DataLoader shuffle reproducible by using a Generator seeded with args.seed, matching HFDataLoader's behavior in dpo.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Log logits at first label position for chosen and rejected to help debug differences between HF and OLMo-core implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…scalar" This reverts commit 6143215.

This reverts commit 31024f1.

Previously compared mean chosen vs mean rejected rewards (always 0 or 1). Now computes per-sample accuracy and averages, matching dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts by adapting branch's micro-batching and rewards_accuracy fix to main's TransformerTrainModule structure. Keep GRPOTrainModule from main. Drop debug logging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nModule merge TransformerTrainModule.__init__ calls parallelize_model which calls init_weights, reinitializing all model weights from scratch. This destroyed the HF checkpoint loaded in _setup_model. Fix by reloading HF weights after parallelization. Also fix micro-batch splitting: use sample_microbatch_size (in samples) for split_batch_dpo instead of rank_microbatch_size (in tokens), matching main branch's DPOTrainModule pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gle-gpu # Conflicts: # open_instruct/dpo.py # open_instruct/olmo_core_train_modules.py

github-actions · 2026-02-18T19:40:02Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2026-02-18 19:40:01.684494357 +0000
+++ site-pr/sitemap.xml	2026-02-18 19:39:59.221033571 +0000
@@ -13,6 +13,10 @@
          <lastmod>2026-02-18</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/dpo_divergence_investigation/</loc>
+         <lastmod>2026-02-18</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

open_instruct/test_dpo_utils_gpu.py

open_instruct/data_loader.py

scripts/train/debug/dpo/single_gpu.sh

open_instruct/dpo.py

Both scripts keep effective batch size=4 but use per_device_train_batch_size=4 with gradient_accumulation_steps=1, so micro-batch averaging differences between OLMo-core (token-weighted) and HF (uniform) don't matter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-18T20:49:05Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2026-02-18 20:49:05.085290798 +0000
+++ site-pr/sitemap.xml	2026-02-18 20:49:02.708144399 +0000
@@ -13,6 +13,10 @@
          <lastmod>2026-02-18</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/dpo_divergence_investigation/</loc>
+         <lastmod>2026-02-18</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

- Fix mock model parameter names to match production code (doc_lens/max_doc_lens) - Use torch RNG instead of numpy in test to match HFDataLoader implementation - Use effective_steps (max_train_steps when set) for scheduler warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hamishivi

generally lgtm but one comment

open_instruct/test_dpo_utils_gpu.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-18T23:37:37Z

open_instruct/dpo.py

        collator=collator,
        device=device,
-        drop_last=False,
+        drop_last=True,


Cache data loader drops examples, causing RuntimeError

High Severity

The cache_data_loader was changed from drop_last=False to drop_last=True. This causes build_reference_logprobs_cache to skip caching some dataset indices (the remainder that doesn't fill a full batch). Since the cache function allocates tensors of size full_dataset_size=len(dataset) and then validates that every index was populated (raising RuntimeError for any -inf entries), this will crash whenever len(dataset) % cache_batch_size != 0.

finbarrtimbers and others added 30 commits January 30, 2026 09:34

metrics fix

6ae4146

Add single_gpu_cache.sh for DPO cache comparison

53ef2ef

Add a version of the single GPU DPO script that calls dpo_tune_cache.py instead of dpo.py, to compare metrics between the two implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix beaker_config UnboundLocalError in dpo_tune_cache.py

74b1d4e

Move beaker_config initialization outside conditional blocks so it's always defined when needed for experiment config updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add --no-host-networking to single GPU DPO scripts

893772e

Avoid port 29500 conflicts on single GPU jobs by disabling host networking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix logger.info call in dpo_tune_cache.py

3c477ba

Remove invalid main_process_only argument from logger.info(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Sync build_reference_logprobs_cache call with dpo_utils.py

075537d

Update the function call to match the current signature in dpo_utils.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix description in single_gpu_cache.sh

a41e9a4

Correctly describe it as using accelerate (dpo_tune_cache.py), not OLMo-core. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Include gradient_accumulation_steps in global_batch_size for dpo.py

f04f080

This makes dpo.py count optimizer updates as steps (like dpo_tune_cache.py) instead of counting micro-batches as steps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Set drop_last=False in dpo.py to match dpo_tune_cache.py

d6bc330

Keep incomplete last batch to match accelerate's default behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add debug logging to investigate logprobs discrepancy between dpo.py …

2646b44

…and dpo_tune_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use PyTorch RNG in HFDataLoader to match dpo_tune_cache.py data ordering

0079553

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add debug logging to compare data ordering between dpo.py and dpo_tun…

6de7ad8

…e_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add detailed logprob debug logging to compare dpo.py and dpo_tune_cac…

b4865ec

…he.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add embedding weight logging to compare HF and OLMo-core models

008fc86

Log the first 5 values and mean of embedding weights to verify whether the model weights are identical between implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix embed weight logging to handle DTensor (FSDP)

c3feb3b

Use .detach().float().cpu() to convert DTensor to regular tensor before calling .tolist() for logging. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use full_tensor() for FSDP sharded weights

017c924

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Revert changes to dpo_tune_cache.py

844aa0b

Keep the original double-shuffling behavior as the DataLoader needs to shuffle during iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Revert torch.manual_seed change to dpo_tune_cache.py

223622a

Cannot modify dpo_tune_cache.py - need to match its behavior from dpo.py side. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use seeded generator for DataLoader shuffle in dpo_tune_cache.py

4d496b1

Makes the DataLoader shuffle reproducible by using a Generator seeded with args.seed, matching HFDataLoader's behavior in dpo.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add detailed logits logging at label positions

4ea43f4

Log logits at first label position for chosen and rejected to help debug differences between HF and OLMo-core implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

finbarrtimbers and others added 3 commits February 11, 2026 07:57

Revert "Fix rewards_accuracy to use per-sample comparison instead of …

5a95094

…scalar" This reverts commit 6143215.

Revert "Remove incorrect trainer.epoch = 0 (default is 1-based)"

72c7715

This reverts commit 31024f1.

Fix rewards_accuracy to use per-sample comparison instead of scalar

e01505b

Previously compared mean chosen vs mean rejected rewards (always 0 or 1). Now computes per-sample accuracy and averages, matching dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

finbarrtimbers force-pushed the finbarr/dpo-match-single-gpu branch from e2f6ffb to e01505b Compare February 11, 2026 19:24

finbarrtimbers and others added 5 commits February 12, 2026 08:25

Merge origin/main into finbarr/dpo-match-single-gpu

faa15a4

Resolve conflicts by adapting branch's micro-batching and rewards_accuracy fix to main's TransformerTrainModule structure. Keep GRPOTrainModule from main. Drop debug logging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove DEBUG logging from DPO forward passes and data loader

f9ab894

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into finbarr/dpo-match-sin…

7043d69

…gle-gpu # Conflicts: # open_instruct/dpo.py # open_instruct/olmo_core_train_modules.py

cleaned up PR and merged to head

f8d6a37

cursor bot reviewed Feb 18, 2026

View reviewed changes

open_instruct/test_dpo_utils_gpu.py Outdated Show resolved Hide resolved

open_instruct/data_loader.py Show resolved Hide resolved

scripts/train/debug/dpo/single_gpu.sh Outdated Show resolved Hide resolved

open_instruct/dpo.py Show resolved Hide resolved

finbarrtimbers and others added 6 commits February 18, 2026 13:51

cleaned up PR

4d55299

cleaned up PR

186b1f7

cleaned up PR

3fff585

set drop_last

19bf948

updated code

fc694ee

finbarrtimbers enabled auto-merge February 18, 2026 23:16

hamishivi reviewed Feb 18, 2026

View reviewed changes

open_instruct/test_dpo_utils_gpu.py Outdated Show resolved Hide resolved

finbarrtimbers requested a review from hamishivi February 18, 2026 23:24

Remove del statement for unused params in mock model forward

c5cb1bb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

hamishivi approved these changes Feb 18, 2026

View reviewed changes

finbarrtimbers added this pull request to the merge queue Feb 18, 2026

cursor bot reviewed Feb 18, 2026

View reviewed changes

Merged via the queue into main with commit c4b10fc Feb 18, 2026
7 of 8 checks passed

finbarrtimbers deleted the finbarr/dpo-match-single-gpu branch February 18, 2026 23:40

gemini-code-assist bot mentioned this pull request Feb 22, 2026

Add TextRLEnvironment for text-based RL environments #1489

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Now, `dpo.py` matches `dpo_tune_cache.py` almost perfectly on the single GPU experiments#1451

Now, `dpo.py` matches `dpo_tune_cache.py` almost perfectly on the single GPU experiments#1451
finbarrtimbers merged 59 commits intomainfrom
finbarr/dpo-match-single-gpu

finbarrtimbers commented Feb 1, 2026 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

hamishivi left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

finbarrtimbers commented Feb 1, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Known areas of deviation between the two:

Uh oh!

github-actions bot commented Feb 18, 2026

Documentation Changes Detected

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 18, 2026

Documentation Changes Detected

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 18, 2026

Choose a reason for hiding this comment

Cache data loader drops examples, causing RuntimeError

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented Feb 1, 2026 •

edited by cursor bot

Loading