Now, dpo.py matches dpo_tune_cache.py almost perfectly on the single GPU experiments#1451
Now, dpo.py matches dpo_tune_cache.py almost perfectly on the single GPU experiments#1451finbarrtimbers merged 59 commits intomainfrom
dpo.py matches dpo_tune_cache.py almost perfectly on the single GPU experiments#1451Conversation
Add a version of the single GPU DPO script that calls dpo_tune_cache.py instead of dpo.py, to compare metrics between the two implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move beaker_config initialization outside conditional blocks so it's always defined when needed for experiment config updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds missing metrics that are tracked in dpo_tune_cache.py: - train/rewards_average: Average of chosen and rejected rewards - train/token_count: Sum of non-padded tokens in chosen + rejected Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Avoid port 29500 conflicts on single GPU jobs by disabling host networking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove invalid main_process_only argument from logger.info(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update the function call to match the current signature in dpo_utils.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Correctly describe it as using accelerate (dpo_tune_cache.py), not OLMo-core. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This makes dpo.py count optimizer updates as steps (like dpo_tune_cache.py) instead of counting micro-batches as steps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep incomplete last batch to match accelerate's default behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…and dpo_tune_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
OLMo-core Transformer doesn't support attention_mask parameter but uses cu_doc_lens for intra-document attention masking. This change adds a pack_padded_sequences helper function that converts padded batches to packed format with cumulative document lengths. Both concatenated_forward_olmo and separate_forward_olmo now properly handle padding by packing sequences on-the-fly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…e_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When packing=False, OLMo-core now unpacks logits back to padded format and uses _get_batch_logps (same as HuggingFace) instead of pf_get_batch_logps. This ensures consistent logprob computation between dpo.py and dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…he.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split large batches into micro-batches of size per_device_train_batch_size and process them one at a time with gradient accumulation. This ensures dpo.py (OLMo-core) and dpo_tune_cache.py (HuggingFace) process the same number of samples per forward pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Log input_ids, attention_mask/cu_doc_lens, labels, logits, and logprobs for both HuggingFace and OLMo-core forward functions to diagnose the logprob differences between dpo.py and dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Log the first 5 values and mean of embedding weights to verify whether the model weights are identical between implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use .detach().float().cpu() to convert DTensor to regular tensor before calling .tolist() for logging. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
load_hf_model() loads weights into the provided state_dict, but model.state_dict() returns a copy, not a reference. The modified state_dict was never loaded back into the model, leaving it with randomly initialized weights. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change HFDataLoader to use NumPy RNG (np.random.default_rng) instead of PyTorch RNG to match HuggingFace Dataset.shuffle() behavior - Remove shuffle=True from DataLoader in dpo_tune_cache.py to avoid double shuffling (dataset is already shuffled) - Add debug logging to verify data indices match Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The dataset was being shuffled twice: 1. Via HF Dataset.shuffle() before passing to HFDataLoader 2. Via numpy permutation inside HFDataLoader._reshard() Now only HFDataLoader._reshard() shuffles, matching dpo_tune_cache.py behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep the original double-shuffling behavior as the DataLoader needs to shuffle during iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
dpo_tune_cache.py does: 1. dataset.shuffle(seed) - HF Dataset shuffle 2. DataLoader(shuffle=True) - PyTorch DataLoader shuffle Now dpo.py does: 1. dataset.shuffle(seed) - HF Dataset shuffle (restored) 2. HFDataLoader._reshard() with PyTorch RNG (torch.randperm) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The torch RNG state gets consumed by model loading between set_seed() and DataLoader creation. Reseeding ensures the DataLoader shuffle uses a fresh RNG state matching HFDataLoader's behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cannot modify dpo_tune_cache.py - need to match its behavior from dpo.py side. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Makes the DataLoader shuffle reproducible by using a Generator seeded with args.seed, matching HFDataLoader's behavior in dpo.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Log logits at first label position for chosen and rejected to help debug differences between HF and OLMo-core implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
e2f6ffb to
e01505b
Compare
Resolve conflicts by adapting branch's micro-batching and rewards_accuracy fix to main's TransformerTrainModule structure. Keep GRPOTrainModule from main. Drop debug logging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nModule merge TransformerTrainModule.__init__ calls parallelize_model which calls init_weights, reinitializing all model weights from scratch. This destroyed the HF checkpoint loaded in _setup_model. Fix by reloading HF weights after parallelization. Also fix micro-batch splitting: use sample_microbatch_size (in samples) for split_batch_dpo instead of rank_microbatch_size (in tokens), matching main branch's DPOTrainModule pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gle-gpu # Conflicts: # open_instruct/dpo.py # open_instruct/olmo_core_train_modules.py
Documentation Changes Detected📄
|
Both scripts keep effective batch size=4 but use per_device_train_batch_size=4 with gradient_accumulation_steps=1, so micro-batch averaging differences between OLMo-core (token-weighted) and HF (uniform) don't matter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documentation Changes Detected📄
|
- Fix mock model parameter names to match production code (doc_lens/max_doc_lens) - Use torch RNG instead of numpy in test to match HFDataLoader implementation - Use effective_steps (max_train_steps when set) for scheduler warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hamishivi
left a comment
There was a problem hiding this comment.
generally lgtm but one comment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| collator=collator, | ||
| device=device, | ||
| drop_last=False, | ||
| drop_last=True, |
There was a problem hiding this comment.
Cache data loader drops examples, causing RuntimeError
High Severity
The cache_data_loader was changed from drop_last=False to drop_last=True. This causes build_reference_logprobs_cache to skip caching some dataset indices (the remainder that doesn't fill a full batch). Since the cache function allocates tensors of size full_dataset_size=len(dataset) and then validates that every index was populated (raising RuntimeError for any -inf entries), this will crash whenever len(dataset) % cache_batch_size != 0.


Colab with plots, particularly loss curve:
Changes:
HFDataLoader, switched to match thedpo_tune_cache.pyiteration order by usingtorch.randpermwith seededtorch.Generatordpo.pyforwards pass by converting padded batches to the packed format. We now passdoc_lensandmax_doc_lens, enabling proper intra-document attention masking.dpo_tune_cache.py.Duration.stepsinstead ofDuration.epochsindpo.pyso we domax_train_stepstraining steps.Changes not related to making DPO match:
token_countmetric indpo_tune_cache.pyto undo the mean reduction.RandomSamplerwithtorch.Generator().manual_seed(seed).Known areas of deviation between the two:
dpo.pysetsdrop_last=Truewhiledpo_tune_cache.pydoesn't.dpo.pyaccumulates gradients using a token-weighted mean, whiledpo_tune_cache.pyuses uniform weighting.We think that
dpo.pydoes things correctly.Note
Medium Risk
Changes core DPO training semantics (data order, masking, checkpoint weight loading, and step scheduling), which can materially affect convergence and reproducibility; scope is contained to DPO paths and covered by updated GPU tests.
Overview
Makes
dpo.py’s single-GPU behavior matchdpo_tune_cache.pymore closely by (1) switchingHFDataLoadershuffling to seededtorch.randpermfor reproducible ordering, (2) fixing OLMo-core attention masking by repacking padded batches intodoc_lens/max_doc_lensformat indpo_utilsforwards, and (3) reloading HF checkpoint weights afterTransformerTrainModuleparallelization to avoid accidental re-init.Also updates training control to be step-based (
max_train_stepsviaDuration.steps+ scheduler step count), forcesdrop_last=Truefor DPO data/cache loaders, makesdpo_tune_cache.pysampling reproducible with a seededRandomSampler, and corrects itstoken_countaggregation after mean reduction; tests and debug script are adjusted accordingly.Written by Cursor Bugbot for commit c5cb1bb. This will update automatically on new commits. Configure here.