Now, dpo.py matches dpo_tune_cache.py almost perfectly on the single GPU experiments (#1451)

finbarrtimbers · claude · web-flow · commit c4b10fc1a3e8 · 2026-02-18T23:33:06.000Z
* metrics fix * Add single_gpu_cache.sh for DPO cache comparison Add a version of the single GPU DPO script that calls dpo_tune_cache.py instead of dpo.py, to compare metrics between the two implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix beaker_config UnboundLocalError in dpo_tune_cache.py Move beaker_config initialization outside conditional blocks so it's always defined when needed for experiment config updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add rewards_average and token_count metrics to DPO Adds missing metrics that are tracked in dpo_tune_cache.py: - train/rewards_average: Average of chosen and rejected rewards - train/token_count: Sum of non-padded tokens in chosen + rejected Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add --no-host-networking to single GPU DPO scripts Avoid port 29500 conflicts on single GPU jobs by disabling host networking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix logger.info call in dpo_tune_cache.py Remove invalid main_process_only argument from logger.info(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Sync build_reference_logprobs_cache call with dpo_utils.py Update the function call to match the current signature in dpo_utils.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix description in single_gpu_cache.sh Correctly describe it as using accelerate (dpo_tune_cache.py), not OLMo-core. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Include gradient_accumulation_steps in global_batch_size for dpo.py This makes dpo.py count optimizer updates as steps (like dpo_tune_cache.py) instead of counting micro-batches as steps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Set drop_last=False in dpo.py to match dpo_tune_cache.py Keep incomplete last batch to match accelerate's default behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add debug logging to investigate logprobs discrepancy between dpo.py and dpo_tune_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add attention masking support for OLMo-core DPO OLMo-core Transformer doesn't support attention_mask parameter but uses cu_doc_lens for intra-document attention masking. This change adds a pack_padded_sequences helper function that converts padded batches to packed format with cumulative document lengths. Both concatenated_forward_olmo and separate_forward_olmo now properly handle padding by packing sequences on-the-fly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Use PyTorch RNG in HFDataLoader to match dpo_tune_cache.py data ordering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add debug logging to compare data ordering between dpo.py and dpo_tune_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Make OLMo-core DPO use same logprob computation as HuggingFace When packing=False, OLMo-core now unpacks logits back to padded format and uses _get_batch_logps (same as HuggingFace) instead of pf_get_batch_logps. This ensures consistent logprob computation between dpo.py and dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add detailed logprob debug logging to compare dpo.py and dpo_tune_cache.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add micro-batching to DPO to match dpo_tune_cache.py batch structure Split large batches into micro-batches of size per_device_train_batch_size and process them one at a time with gradient accumulation. This ensures dpo.py (OLMo-core) and dpo_tune_cache.py (HuggingFace) process the same number of samples per forward pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add debug logging to compare HF and OLMo-core forward passes Log input_ids, attention_mask/cu_doc_lens, labels, logits, and logprobs for both HuggingFace and OLMo-core forward functions to diagnose the logprob differences between dpo.py and dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add embedding weight logging to compare HF and OLMo-core models Log the first 5 values and mean of embedding weights to verify whether the model weights are identical between implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix embed weight logging to handle DTensor (FSDP) Use .detach().float().cpu() to convert DTensor to regular tensor before calling .tolist() for logging. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Use full_tensor() for FSDP sharded weights Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix: Actually load HF weights into OLMo-core model load_hf_model() loads weights into the provided state_dict, but model.state_dict() returns a copy, not a reference. The modified state_dict was never loaded back into the model, leaving it with randomly initialized weights. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Align data ordering between dpo.py and dpo_tune_cache.py - Change HFDataLoader to use NumPy RNG (np.random.default_rng) instead of PyTorch RNG to match HuggingFace Dataset.shuffle() behavior - Remove shuffle=True from DataLoader in dpo_tune_cache.py to avoid double shuffling (dataset is already shuffled) - Add debug logging to verify data indices match Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove double shuffling from dpo.py The dataset was being shuffled twice: 1. Via HF Dataset.shuffle() before passing to HFDataLoader 2. Via numpy permutation inside HFDataLoader._reshard() Now only HFDataLoader._reshard() shuffles, matching dpo_tune_cache.py behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Revert changes to dpo_tune_cache.py Keep the original double-shuffling behavior as the DataLoader needs to shuffle during iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Implement double-shuffle in dpo.py to match dpo_tune_cache.py dpo_tune_cache.py does: 1. dataset.shuffle(seed) - HF Dataset shuffle 2. DataLoader(shuffle=True) - PyTorch DataLoader shuffle Now dpo.py does: 1. dataset.shuffle(seed) - HF Dataset shuffle (restored) 2. HFDataLoader._reshard() with PyTorch RNG (torch.randperm) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Reseed torch RNG before DataLoader creation in dpo_tune_cache.py The torch RNG state gets consumed by model loading between set_seed() and DataLoader creation. Reseeding ensures the DataLoader shuffle uses a fresh RNG state matching HFDataLoader's behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Revert torch.manual_seed change to dpo_tune_cache.py Cannot modify dpo_tune_cache.py - need to match its behavior from dpo.py side. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Use seeded generator for DataLoader shuffle in dpo_tune_cache.py Makes the DataLoader shuffle reproducible by using a Generator seeded with args.seed, matching HFDataLoader's behavior in dpo.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add detailed logits logging at label positions Log logits at first label position for chosen and rejected to help debug differences between HF and OLMo-core implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add input token logging at position 445-450 for debugging Compare chosen vs rejected input tokens at the same positions to verify they have identical prompt content. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add packed logits logging at position 447 for debugging Compare packed_logits[447] (chosen) vs packed_logits[rejected_start+447] to debug attention masking between documents. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix OLMo-core attention masking by using correct argument names The model expects doc_lens (individual lengths) and max_doc_lens (list), not cu_doc_lens (cumulative) and max_doc_len (int). This fix enables proper document boundary masking in concatenated_forward_olmo. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove debug logging that causes index errors for short sequences Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix DataLoader shuffle to match HFDataLoader's randperm order Use explicit RandomSampler instead of shuffle=True to avoid RNG state consumption that causes different iteration order between DataLoader and HFDataLoader. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add debug logging to verify randperm behavior on Beaker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add dataset_len to debug logging for randperm verification Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix dpo.py epoch alignment with dpo_tune_cache.py The olmo-core Trainer starts at epoch=1 by default, which causes data_loader.reshuffle(1) to be called with seed=seed+1=124 instead of seed=123. This resulted in different data ordering between dpo.py and dpo_tune_cache.py after the first 4 samples. Setting trainer.epoch=0 before fit() ensures both implementations use the same seed=123 for data shuffling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Apply H17, H25, H16 fixes from DPO divergence investigation - H17: Swap optim.step() before scheduler.set_lr() so optimizer uses correct LR (was advancing LR before optimizer step) - H25: Initialize optimizer LR to 0.0 (warmup start) to match HF behavior (was using full learning_rate for first step) - H16: Use Duration.steps(max_train_steps) when set instead of always Duration.epochs() (fixes step count mismatch) - Copy investigation doc from old branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert H17/H25, keep H16: use Duration.steps() instead of Duration.epochs() The step count mismatch (96 vs 72) was caused by Duration.epochs() counting more steps than expected due to dataloader padding. Always use Duration.steps(num_training_steps) to match dpo_tune_cache.py. Reverted H17 (scheduler ordering swap) and H25 (lr=0 init) as they were incorrect — the original set_lr-before-optim.step order is correct given OLMo-core's pre-incremented global_step. Also adds compare_wandb_runs.py script with per-step output and step-count-mismatch tolerance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add validation notebook for DPO comparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove incorrect trainer.epoch = 0 (default is 1-based) OLMo-core uses 1-based epochs. Setting epoch=0 caused Duration.epochs(N) to compute N+1 epochs worth of steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix rewards_accuracy to use per-sample comparison instead of scalar Previously compared mean chosen vs mean rejected rewards (always 0 or 1). Now computes per-sample accuracy and averages, matching dpo_tune_cache.py. Also set drop_last=True in dpo.py data loaders. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "Fix rewards_accuracy to use per-sample comparison instead of scalar" This reverts commit 6143215. * Revert "Remove incorrect trainer.epoch = 0 (default is 1-based)" This reverts commit 31024f1. * Fix rewards_accuracy to use per-sample comparison instead of scalar Previously compared mean chosen vs mean rejected rewards (always 0 or 1). Now computes per-sample accuracy and averages, matching dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix HF weight loading and micro-batch splitting after TransformerTrainModule merge TransformerTrainModule.__init__ calls parallelize_model which calls init_weights, reinitializing all model weights from scratch. This destroyed the HF checkpoint loaded in _setup_model. Fix by reloading HF weights after parallelization. Also fix micro-batch splitting: use sample_microbatch_size (in samples) for split_batch_dpo instead of rank_microbatch_size (in tokens), matching main branch's DPOTrainModule pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove DEBUG logging from DPO forward passes and data loader Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cleaned up PR and merged to head * Set gradient_accumulation_steps=1 in debug DPO scripts Both scripts keep effective batch size=4 but use per_device_train_batch_size=4 with gradient_accumulation_steps=1, so micro-batch averaging differences between OLMo-core (token-weighted) and HF (uniform) don't matter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cleaned up PR * cleaned up PR * cleaned up PR * Fix PR #1451 review comments - Fix mock model parameter names to match production code (doc_lens/max_doc_lens) - Use torch RNG instead of numpy in test to match HFDataLoader implementation - Use effective_steps (max_train_steps when set) for scheduler warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * set drop_last * updated code * Remove del statement for unused params in mock model forward Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -28,6 +28,7 @@ All notable changes to this project will be documented in this file.
 - Increased vLLM health check timeout from 30s to 600s (10 minutes) (https://github.com/allenai/open-instruct/pull/1452).
 - Updated vllm version to 0.14.1 (https://github.com/allenai/open-instruct/pull/1433).
 - Changed default wandb x-axis from `episode` to `training_step` for grpo_fast (https://github.com/allenai/open-instruct/pull/1437).
+- Made a bunch of changes to `dpo.py` so it matches `dpo_tune_cache.py` perfectly (https://github.com/allenai/open-instruct/pull/1451).
 
 ### Fixed
 - Fixed test `single_example_collator` returning raw int for index, causing `TypeError` in `_iter_batches` (https://github.com/allenai/open-instruct/pull/1477).
diff --git a/open_instruct/data_loader.py b/open_instruct/data_loader.py
@@ -231,12 +231,13 @@ def _reshard(self, epoch: int) -> None:
 
         Uses index-based shuffling to avoid copying the dataset.
         """
-        rng = np.random.default_rng(self.seed + epoch)
-        all_indices = np.arange(len(self._full_dataset))
+        generator = torch.Generator()
+        generator.manual_seed(self.seed + epoch)
+        dataset_len = len(self._full_dataset)
+        all_indices = torch.randperm(dataset_len, generator=generator).numpy()
         if self._excluded_indices:
             mask = np.isin(all_indices, list(self._excluded_indices), invert=True)
             all_indices = all_indices[mask]
-        rng.shuffle(all_indices)
 
         global_size = len(all_indices)
         total_batches = global_size // self._batch_size
diff --git a/open_instruct/dpo.py b/open_instruct/dpo.py
@@ -85,7 +85,7 @@ def _load():
 
 
 def _setup_model(args: dpo_utils.ExperimentConfig, device: torch.device):
-    """Load and configure OLMo-core model."""
+    """Build OLMo-core model architecture (weights loaded after parallelization)."""
     hf_config = transformers.AutoConfig.from_pretrained(args.model_name_or_path)
     vocab_size = hf_config.vocab_size
     logger.info(f"Building OLMo-core model with vocab_size={vocab_size}")
@@ -103,10 +103,6 @@ def _setup_model(args: dpo_utils.ExperimentConfig, device: torch.device):
     )
     model = model_config.build(init_device="cpu")
 
-    logger.info(f"Loading HuggingFace weights from {args.model_name_or_path}")
-    load_hf_model(args.model_name_or_path, model.state_dict(), work_dir=args.output_dir)
-    model = model.to(device=device, dtype=torch.bfloat16)
-
     return model, model_config
 
 
@@ -271,7 +267,7 @@ def main(args: dpo_utils.ExperimentConfig, tc: dataset_transformation.TokenizerC
 
     dataset = _load_dataset_distributed(args, tc, transform_fn_args, is_main_process)
     dataset = dataset.shuffle(seed=args.seed)
-    dataset.set_format(type="pt")  # Must be after shuffle (shuffle resets format)
+    dataset.set_format(type="pt")
 
     world_size = distributed_utils.get_world_size() if distributed_utils.is_distributed() else 1
     dp_world_size = world_size // args.tensor_parallel_degree
@@ -308,6 +304,7 @@ def main(args: dpo_utils.ExperimentConfig, tc: dataset_transformation.TokenizerC
         work_dir=args.output_dir,
         collator=collator,
         device=device,
+        drop_last=True,
     )
     # 4x batch size: forward-only (no backward), so no activation storage needed.
     # With packing, the collator's token budget controls the actual forward-pass size
@@ -325,7 +322,7 @@ def main(args: dpo_utils.ExperimentConfig, tc: dataset_transformation.TokenizerC
         work_dir=args.output_dir,
         collator=collator,
         device=device,
-        drop_last=False,
+        drop_last=True,
     )
 
     forward_fn = dpo_utils.concatenated_forward_olmo if args.concatenated_forward else dpo_utils.separate_forward_olmo
@@ -350,8 +347,9 @@ def main(args: dpo_utils.ExperimentConfig, tc: dataset_transformation.TokenizerC
 
     data_loader.reshuffle(epoch=0)
     num_training_steps = len(data_loader) * args.num_epochs
+    effective_steps = args.max_train_steps if args.max_train_steps is not None else num_training_steps
     optim_config = AdamWConfig(lr=args.learning_rate, weight_decay=args.weight_decay, fused=args.fused_optimizer)
-    scheduler = _setup_scheduler(args, num_training_steps)
+    scheduler = _setup_scheduler(args, effective_steps)
     max_grad_norm = args.max_grad_norm if args.max_grad_norm > 0 else None
     dp_config = transformer_config.TransformerDataParallelConfig(
         name=DataParallelType.hsdp,
@@ -384,9 +382,13 @@ def main(args: dpo_utils.ExperimentConfig, tc: dataset_transformation.TokenizerC
         device=device,
     )
 
-    # Build reference cache after train_module init because TransformerTrainModule applies
-    # FSDP parallelism to the model, and we need the parallelized model to calculate the
-    # logprobs in case the model is too big to fit in memory.
+    # TransformerTrainModule.__init__ calls parallelize_model which calls init_weights,
+    # reinitializing all model weights from scratch. We must reload the HF checkpoint.
+    logger.info("Reloading HuggingFace weights after parallelization...")
+    sd = train_module.model.state_dict()
+    load_hf_model(args.model_name_or_path, sd, work_dir=args.output_dir)
+    train_module.model.load_state_dict(sd)
+
     logger.info("Caching reference logprobs...")
     train_module.reference_cache = dpo_utils.build_reference_logprobs_cache(model=train_module.model, **cache_kwargs)
 
@@ -399,14 +401,21 @@ def main(args: dpo_utils.ExperimentConfig, tc: dataset_transformation.TokenizerC
 
     trainer_callbacks = _setup_callbacks(args, dp_world_size)
 
+    if args.max_train_steps is not None:
+        max_duration = train.Duration.steps(args.max_train_steps)
+    else:
+        max_duration = train.Duration.steps(num_training_steps)
+
     trainer = train.TrainerConfig(
         save_folder=args.output_dir,
-        max_duration=train.Duration.epochs(args.num_epochs),
+        max_duration=max_duration,
         metrics_collect_interval=args.logging_steps,
         callbacks=trainer_callbacks,
         save_overwrite=True,
     ).build(train_module, data_loader)
 
+    trainer.epoch = 0
+
     logger.info("Starting training...")
     trainer.fit()
     logger.info("Training complete.")
diff --git a/open_instruct/dpo_tune_cache.py b/open_instruct/dpo_tune_cache.py
@@ -43,7 +43,7 @@
 from huggingface_hub import HfApi
 from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
 from rich.pretty import pprint
-from torch.utils.data import DataLoader
+from torch.utils.data import DataLoader, RandomSampler
 from tqdm.auto import tqdm
 from transformers import AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, get_scheduler
 
@@ -407,8 +407,9 @@ def load_model():
     else:
         collate_fn = dpo_utils.DataCollatorForSeq2SeqDPO(tokenizer=tokenizer, model=model, padding="longest")
 
+    train_sampler = RandomSampler(train_dataset, generator=torch.Generator().manual_seed(args.seed))
     train_dataloader = DataLoader(
-        train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=args.per_device_train_batch_size
+        train_dataset, sampler=train_sampler, collate_fn=collate_fn, batch_size=args.per_device_train_batch_size
     )
 
     # Optimizer
@@ -535,6 +536,7 @@ def load_model():
             is_main_process=accelerator.is_main_process,
             model_dims=model_dims,
             use_lora=args.use_lora,
+            disable_adapter_context=None,
         )
         logger.info("=============after cache logprobs")
         print_gpu_stats(init_gpu_memory)
@@ -573,6 +575,7 @@ def load_model():
                     average_log_prob=args.loss_type.is_average_loss,
                     output_router_logits=args.load_balancing_loss,
                 )  # `aux_loss` is only used when `args.load_balancing_loss = True`
+
                 losses, chosen_rewards, rejected_rewards = dpo_utils.compute_loss(
                     args,
                     batch,
@@ -621,7 +624,9 @@ def load_model():
                     # single all reduce to save time, avoiding per metric all reduce
                     global_metrics_tensor = accelerator.reduce(local_metrics.metrics, reduction="mean")
                     global_metrics_tensor /= args.gradient_accumulation_steps * args.logging_steps
-                    global_metrics_tensor[local_metrics.names2idx["token_count"]] *= accelerator.num_processes
+                    global_metrics_tensor[local_metrics.names2idx["token_count"]] *= (
+                        accelerator.num_processes * args.gradient_accumulation_steps * args.logging_steps
+                    )
                     global_metrics = {
                         name: global_metrics_tensor[index].item() for name, index in local_metrics.names2idx.items()
                     }
diff --git a/open_instruct/dpo_utils.py b/open_instruct/dpo_utils.py
@@ -873,6 +873,72 @@ def concatenated_inputs(batch: dict[str, list | torch.Tensor]) -> dict[str, torc
     return concatenated_batch
 
 
+def unpack_to_padded(
+    packed_logits: torch.Tensor, cu_doc_lens: torch.Tensor, batch_size: int, max_seq_len: int, pad_value: float = 0.0
+) -> torch.Tensor:
+    """Unpack packed logits back to padded format (batch_size, max_seq_len, vocab_size).
+
+    Args:
+        packed_logits: Packed logits of shape (1, total_tokens, vocab_size).
+        cu_doc_lens: Cumulative document lengths of shape (batch_size + 1,).
+        batch_size: Number of sequences in the batch.
+        max_seq_len: Maximum sequence length for padding.
+        pad_value: Value to use for padding (default 0.0).
+
+    Returns:
+        Padded logits of shape (batch_size, max_seq_len, vocab_size).
+    """
+    vocab_size = packed_logits.shape[-1]
+    padded = torch.full(
+        (batch_size, max_seq_len, vocab_size), pad_value, dtype=packed_logits.dtype, device=packed_logits.device
+    )
+    splits = cu_doc_lens.diff().tolist()
+    packed_list = torch.split(packed_logits.squeeze(0), splits, dim=0)
+    for i, doc_logits in enumerate(packed_list):
+        padded[i, : doc_logits.shape[0]] = doc_logits
+    return padded
+
+
+def pack_padded_sequences(
+    input_ids: torch.Tensor, labels: torch.Tensor, attention_mask: torch.Tensor
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, int]:
+    """Convert padded sequences to packed format with cumulative document lengths.
+
+    This is needed for OLMo-core models which don't support attention_mask but use
+    cu_doc_lens for intra-document attention masking.
+
+    Args:
+        input_ids: Padded input IDs of shape (batch_size, seq_len).
+        labels: Padded labels of shape (batch_size, seq_len).
+        attention_mask: Attention mask of shape (batch_size, seq_len), where 1 indicates
+            valid tokens and 0 indicates padding.
+
+    Returns:
+        Tuple of (packed_input_ids, packed_labels, cu_doc_lens, max_doc_len).
+        - packed_input_ids: Shape (1, total_tokens) with all sequences concatenated.
+        - packed_labels: Shape (1, total_tokens) with all labels concatenated.
+        - cu_doc_lens: Cumulative document lengths of shape (batch_size + 1,).
+        - max_doc_len: Maximum document length in the batch.
+    """
+    batch_size = input_ids.shape[0]
+    seq_lengths = attention_mask.sum(dim=1)
+    max_doc_len = int(seq_lengths.max().item())
+    cu_doc_lens = torch.zeros(batch_size + 1, dtype=torch.int32, device=input_ids.device)
+    cu_doc_lens[1:] = seq_lengths.cumsum(dim=0)
+
+    packed_input_ids_list = []
+    packed_labels_list = []
+    for i in range(batch_size):
+        length = seq_lengths[i].item()
+        packed_input_ids_list.append(input_ids[i, :length])
+        packed_labels_list.append(labels[i, :length])
+
+    packed_input_ids = torch.cat(packed_input_ids_list, dim=0).unsqueeze(0)
+    packed_labels = torch.cat(packed_labels_list, dim=0).unsqueeze(0)
+
+    return packed_input_ids, packed_labels, cu_doc_lens, max_doc_len
+
+
 def concatenated_forward(
     model: nn.Module,
     batch: dict[str, list | torch.Tensor],
@@ -905,6 +971,7 @@ def concatenated_forward(
         for k, v in concatenated_batch.items()
         if k.startswith("concatenated_") and not k.endswith("labels")
     }
+
     if output_router_logits:
         outputs = model(**inputs, output_router_logits=True)
         logits = outputs.logits.to(torch.float32)
@@ -1023,25 +1090,41 @@ def concatenated_forward_olmo(
         Tuple of (chosen_logps, rejected_logps, aux_loss). aux_loss is always None for OLMo-core.
     """
     del output_router_logits
+    bs = batch["chosen_input_ids"].shape[0]
+
     if not packing:
         concatenated_batch = concatenated_inputs(batch)
-    else:
-        concatenated_batch, bs = pf_concatenated_inputs(batch)
+        packed_input_ids, packed_labels, cu_doc_lens, max_doc_len = pack_padded_sequences(
+            concatenated_batch["concatenated_input_ids"],
+            concatenated_batch["concatenated_labels"],
+            concatenated_batch["concatenated_attention_mask"],
+        )
 
-    logits = model(concatenated_batch["concatenated_input_ids"]).to(torch.float32)
+        doc_lens = cu_doc_lens.diff()
+        packed_logits = model(packed_input_ids, doc_lens=doc_lens, max_doc_lens=[max_doc_len]).to(torch.float32)
+
+        batch_size = concatenated_batch["concatenated_input_ids"].shape[0]
+        max_seq_len = concatenated_batch["concatenated_input_ids"].shape[1]
+        logits = unpack_to_padded(packed_logits, cu_doc_lens, batch_size, max_seq_len)
 
-    if not packing:
         all_logps = _get_batch_logps(
             logits, concatenated_batch["concatenated_labels"], average_log_prob=average_log_prob
         )
-        bs = batch["chosen_input_ids"].shape[0]
     else:
+        concatenated_batch, bs = pf_concatenated_inputs(batch)
+        cu_doc_lens_packing = concatenated_batch["concatenated_cu_seq_lens_k"]
+        doc_lens_packing = cu_doc_lens_packing.diff()
+        max_doc_len_packing = concatenated_batch["concatenated_max_length_k"]
+        logits = model(
+            concatenated_batch["concatenated_input_ids"], doc_lens=doc_lens_packing, max_doc_lens=[max_doc_len_packing]
+        ).to(torch.float32)
         all_logps = pf_get_batch_logps(
             logits,
             concatenated_batch["concatenated_labels"],
             concatenated_batch["concatenated_cu_seq_lens_k"],
             average_log_prob=average_log_prob,
         )
+
     chosen_logps = all_logps[:bs]
     rejected_logps = all_logps[bs:]
     return chosen_logps, rejected_logps, None
@@ -1069,17 +1152,29 @@ def separate_forward_olmo(
     """
     del output_router_logits
     chosen_batch = process_batch(batch, "chosen")
-    chosen_logits = model(chosen_batch["input_ids"]).to(torch.float32)
-
+    packed_input_ids, _, cu_doc_lens, max_doc_len = pack_padded_sequences(
+        chosen_batch["input_ids"], chosen_batch["labels"], chosen_batch["attention_mask"]
+    )
+    doc_lens = cu_doc_lens.diff()
+    packed_logits = model(packed_input_ids, doc_lens=doc_lens, max_doc_lens=[max_doc_len]).to(torch.float32)
+    batch_size = chosen_batch["input_ids"].shape[0]
+    max_seq_len = chosen_batch["input_ids"].shape[1]
+    chosen_logits = unpack_to_padded(packed_logits, cu_doc_lens, batch_size, max_seq_len)
     chosen_logps = _get_batch_logps(chosen_logits, chosen_batch["labels"], average_log_prob=average_log_prob)
-    del chosen_batch, chosen_logits
+    del chosen_batch, chosen_logits, packed_input_ids, packed_logits
     torch.cuda.empty_cache()
 
     rejected_batch = process_batch(batch, "rejected")
-    rejected_logits = model(rejected_batch["input_ids"]).to(torch.float32)
-
+    packed_input_ids, _, cu_doc_lens, max_doc_len = pack_padded_sequences(
+        rejected_batch["input_ids"], rejected_batch["labels"], rejected_batch["attention_mask"]
+    )
+    doc_lens = cu_doc_lens.diff()
+    packed_logits = model(packed_input_ids, doc_lens=doc_lens, max_doc_lens=[max_doc_len]).to(torch.float32)
+    batch_size = rejected_batch["input_ids"].shape[0]
+    max_seq_len = rejected_batch["input_ids"].shape[1]
+    rejected_logits = unpack_to_padded(packed_logits, cu_doc_lens, batch_size, max_seq_len)
     rejected_logps = _get_batch_logps(rejected_logits, rejected_batch["labels"], average_log_prob=average_log_prob)
-    del rejected_batch, rejected_logits
+    del rejected_batch, rejected_logits, packed_input_ids, packed_logits
     torch.cuda.empty_cache()
 
     return chosen_logps, rejected_logps, None
diff --git a/open_instruct/olmo_core_train_modules.py b/open_instruct/olmo_core_train_modules.py
@@ -107,9 +107,6 @@ def __init__(
             self._forward_kwargs["packing"] = True
 
     def pre_train(self):
-        # Override to skip batch size validation from TransformerTrainModule.
-        # DPO processes 2x sequences per batch (chosen + rejected), so the parent's
-        # validation (global_batch_size % rank_microbatch_size == 0) would fail.
         pass
 
     def _compute_microbatch_loss(self, micro_batch: dict[str, Any]) -> tuple[torch.Tensor, dict[str, torch.Tensor]]:
diff --git a/open_instruct/test_data_loader_gpu.py b/open_instruct/test_data_loader_gpu.py
@@ -2,7 +2,6 @@
 import unittest
 
 import datasets
-import numpy as np
 import parameterized
 import torch
 
@@ -92,9 +91,9 @@ def test_multi_rank_sampling(self, name, dp_world_size):
             union |= indices
         total_batches = num_examples // batch_size
         usable_size = total_batches * batch_size
-        rng = np.random.default_rng(42)
-        shuffled = np.arange(num_examples)
-        rng.shuffle(shuffled)
+        generator = torch.Generator()
+        generator.manual_seed(42)
+        shuffled = torch.randperm(num_examples, generator=generator).numpy()
         expected_indices = set(shuffled[:usable_size].tolist())
         self.assertEqual(union, expected_indices)
 
diff --git a/open_instruct/test_dpo_utils_gpu.py b/open_instruct/test_dpo_utils_gpu.py
@@ -80,7 +80,9 @@ def __init__(self, vocab_size: int = 1000):
         self.embed = torch.nn.Embedding(vocab_size, 64)
         self.linear = torch.nn.Linear(64, vocab_size)
 
-    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
+    def forward(
+        self, input_ids: torch.Tensor, doc_lens: torch.Tensor | None = None, max_doc_lens: list[int] | None = None
+    ) -> torch.Tensor:
         return self.linear(self.embed(input_ids))
 
 
diff --git a/scripts/train/debug/dpo/single_gpu.sh b/scripts/train/debug/dpo/single_gpu.sh
@@ -19,8 +19,8 @@ uv run python mason.py \
     --model_name_or_path allenai/OLMo-2-0425-1B \
     --tokenizer_name_or_path allenai/OLMo-2-0425-1B \
     --max_seq_length 1024 \
-    --per_device_train_batch_size 1 \
-    --gradient_accumulation_steps 4 \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 1 \
     --learning_rate 5e-07 \
     --lr_scheduler_type linear \
     --warmup_ratio 0.1 \