Notable changelog items:
- Fixed a major but where ZeRO-2 was discarding gradients.
- We now support training with a RLEnvironment abstraction, including code execution sandboxing.
- The Olmo-core DPO is ready to go, with support for TP, packing, and torch.compile. This should be much faster than the previous HF implementation.
- We have a model merging implementation that runs on Beaker (for internal Ai2 users).
- Now, GRPO doesn't let the actors run excessively ahead of the learner during training if the actors are much faster than the learners.
Fixed
- Fix ZeRO-2 discarding gradients during manual gradient accumulation by using
set_gradient_accumulation_boundary()(#1498).
Added
- Clean up OLMo 3.X tokenizer docs: clarify think SFT tokenization workaround, add dev/release tokenizer matrix, create
allenai/olmo-3-tokenizer-instruct-release(#1487). - Add Docker sandbox backend and
GenericSandboxEnvenvironment for code execution during RL training.DockerBackendwith command timeout, configurable memory limits,put_archive/get_archivefile I/O, andremove=Trueauto-cleanup.GenericSandboxEnvprovidesexecute_bash(stateful bash with env/cwd persistence) andstr_replace_editor(view/create/str_replace/insert with correct line numbering). Configurable penalty, image, and memory viaGenericSandboxEnvConfig. Includes 1-GPU debug script (#1490). TextRLEnvironmentbase class for text-based RL environments: model output is passed as a plain string instead of parsed tool calls, with response formatted using the parser'srole_template. IncludesWordleTextEnvexample, role-awareformat_tool_outputsin all parsers, shadow tool call dispatch inprocess_request, and 1-GPU debug script (#1489).- Wire RL environments into vLLM generation loop and preprocessing: unified tool/env system with single
TOOL_REGISTRY, pooled actors via sharedEnvironmentPoolRay actor (async acquire/release, auto-sized to rollout concurrency),RolloutStatetracks all per-rollout state,PassthroughVerifier+RewardAggregatorfor per-turn rewards (verifier score folded into last turn before aggregation),BaseEnvConfiginenvironments/base.py,--max_stepsunified,--pool_sizeconfigurable, auto-discovery of tools from datasets, 1-GPU debug scripts for counter/guess_number envs (#1479). - RL environment abstraction:
RLEnvironmentbase class withToolas a subclass, unifying tools and environments under a singlestep(EnvCall) -> StepResultinterface. RemovesExecutable/EnvOutput/_execute/safe_executeindirection. Moves tools underopen_instruct/environments/tools/. Includes example environments (CounterEnv,GuessNumberEnv) (#1478). - Enable packing with torch.compile for DPO training, fix cu_seq_lens offset bug for padded chosen/rejected sequences, add tokens_per_second_per_gpu metric (#1466).
- Model merging scripts for Beaker:
mergekit_merge.sh(mergekit) anddirect_merge.sh(direct safetensors averaging for hybrid models) (#1459). - Production DPO script for OLMo3-7B hybrid (#1449).
- Gradient accumulation/microbatching support for OLMo-core DPO training (#1447).
- Evolving rubrics support with RubricVerifier and utility functions for GRPO training (#1460).
- New perf metrics in PerfCallback: total_tokens, data_loading_seconds, data_loading_pct, wall_clock_per_step, step_overhead_pct (#1457).
- Warning when eval prompts are queuing up (new eval round starts before the previous one completes) (#1461).
- OLMo 3 tokenizer settings documentation covering chat template decisions for Instruct and Think models (#1455).
- torch.compile support for OLMo-core DPO training (#1445).
- Adds a GRPOTrainModule as part of the Olmo-core migration (#1412)
- FSDP shard_degree and num_replicas configuration for OLMo-core DPO training (#1446).
- Budget mode gradient checkpointing support for OLMo-core DPO training (#1444).
- PerfCallback for MFU metrics in OLMo-core DPO training (#1442).
- NVIDIA H200 GPU support in
GPU_SPECS(#1441). - Documentation and runtime warning for
dataset_mixer_listformat (float=proportion, int=count) (#1434).
Changed
- Bound async data preparation to stay within
async_stepsof training, preventing training data getting too far out of sync with trainer. (#1496). - Refactor Legacy and DRTulu tool parsers to use OpenAI-format
tool_definitionsinstead of Raytool_actors. Removesimport rayfromparsers.py, fixes DRTulu parser which was broken after the pool refactor, and fixes--tool_parser_typetypo in dr_tulu debug script (#1491). - Replaces lambda collators with a "single_example_collator" (#1472).
- Clarified
activation_memory_budgetguidance in DPO utils with a practical default (0.5) and memory/speed tradeoff notes (#1460). - Let TransformerTrainModule handle FSDP parallelism instead of manual application in DPO (#1458).
- Refactored DPOTrainModule to inherit from TransformerTrainModule (#1456)
- Increased vLLM health check timeout from 30s to 600s (10 minutes) (#1452).
- Updated vllm version to 0.14.1 (#1433).
- Changed default wandb x-axis from
episodetotraining_stepfor grpo_fast (#1437). - Made a bunch of changes to
dpo.pyso it matchesdpo_tune_cache.pyperfectly (#1451).
Fixed
- Updated previous fix of weight sync thread to only be active when
inflight_updates=False, removing an issue with weight sync updates stalling (#1499). - Fixed weight sync thread hang when
inflight_updates=False: wait for all vLLMengine.update_weightRPCs to complete before unpausing actors, preventinghealth_check_fnfrom blocking indefinitely (#1480). - Fixed
nodes_neededcalculation ingrpo_fastkv_cache_max_concurrencywarning usingmath.ceil()instead of floor division to avoid undercounting required inference nodes (#1474). - Fixed
eval_on_step_0never triggering ingrpo_fastbecause it was gated behind thetraining_step % local_eval_every == 0modulo check; also guardlocal_eval_every <= 0to prevent accidental every-step eval orZeroDivisionError(#1485). - Fixed
TypeErrorinpack_padded_sequenceswhenattention_maskis a float tensor, and vectorized the packing to avoid per-sequence host-device synchronizations (#1486). - Fixed silent prompt/ground-truth mismatch in RLVR caused by redundant dataset shuffle desyncing the
"index"column from positional indices, leading to wrong rewards and wrongexclude_indexexclusions (#1484). - Fixed test
single_example_collatorreturning raw int for index, causingTypeErrorin_iter_batches(#1477). - Fixed SFT integration test failing due to missing
--try_launch_beaker_eval_jobs falseflag (#1470). - Fixed checkpoint cleanup race condition on shared filesystems by using
ignore_errors=Trueand restricting cleanup to global rank 0 (#1468). - Fixed checkpoint resume failing on Beaker retries by removing non-deterministic timestamp from
exp_name(#1468). - Fixed MFU calculation to count LM head FLOPs per token (#1457).
- Fixed training hang when
inflight_updatesis disabled by waiting for weight sync to complete before health check (#1454). - Fixed evaluation responses being lost on timeout in grpo_fast by requeuing partial results (#1439).
- Beaker Experiment Launch now passes (#1424 (review)).