Releases: allenai/open-instruct
Releases · allenai/open-instruct
v0.2.0
Notable changelog items:
- Fixed a major but where ZeRO-2 was discarding gradients.
- We now support training with a RLEnvironment abstraction, including code execution sandboxing.
- The Olmo-core DPO is ready to go, with support for TP, packing, and torch.compile. This should be much faster than the previous HF implementation.
- We have a model merging implementation that runs on Beaker (for internal Ai2 users).
- Now, GRPO doesn't let the actors run excessively ahead of the learner during training if the actors are much faster than the learners.
Fixed
- Fix ZeRO-2 discarding gradients during manual gradient accumulation by using
set_gradient_accumulation_boundary()(#1498).
Added
- Clean up OLMo 3.X tokenizer docs: clarify think SFT tokenization workaround, add dev/release tokenizer matrix, create
allenai/olmo-3-tokenizer-instruct-release(#1487). - Add Docker sandbox backend and
GenericSandboxEnvenvironment for code execution during RL training.DockerBackendwith command timeout, configurable memory limits,put_archive/get_archivefile I/O, andremove=Trueauto-cleanup.GenericSandboxEnvprovidesexecute_bash(stateful bash with env/cwd persistence) andstr_replace_editor(view/create/str_replace/insert with correct line numbering). Configurable penalty, image, and memory viaGenericSandboxEnvConfig. Includes 1-GPU debug script (#1490). TextRLEnvironmentbase class for text-based RL environments: model output is passed as a plain string instead of parsed tool calls, with response formatted using the parser'srole_template. IncludesWordleTextEnvexample, role-awareformat_tool_outputsin all parsers, shadow tool call dispatch inprocess_request, and 1-GPU debug script (#1489).- Wire RL environments into vLLM generation loop and preprocessing: unified tool/env system with single
TOOL_REGISTRY, pooled actors via sharedEnvironmentPoolRay actor (async acquire/release, auto-sized to rollout concurrency),RolloutStatetracks all per-rollout state,PassthroughVerifier+RewardAggregatorfor per-turn rewards (verifier score folded into last turn before aggregation),BaseEnvConfiginenvironments/base.py,--max_stepsunified,--pool_sizeconfigurable, auto-discovery of tools from datasets, 1-GPU debug scripts for counter/guess_number envs (#1479). - RL environment abstraction:
RLEnvironmentbase class withToolas a subclass, unifying tools and environments under a singlestep(EnvCall) -> StepResultinterface. RemovesExecutable/EnvOutput/_execute/safe_executeindirection. Moves tools underopen_instruct/environments/tools/. Includes example environments (CounterEnv,GuessNumberEnv) (#1478). - Enable packing with torch.compile for DPO training, fix cu_seq_lens offset bug for padded chosen/rejected sequences, add tokens_per_second_per_gpu metric (#1466).
- Model merging scripts for Beaker:
mergekit_merge.sh(mergekit) anddirect_merge.sh(direct safetensors averaging for hybrid models) (#1459). - Production DPO script for OLMo3-7B hybrid (#1449).
- Gradient accumulation/microbatching support for OLMo-core DPO training (#1447).
- Evolving rubrics support with RubricVerifier and utility functions for GRPO training (#1460).
- New perf metrics in PerfCallback: total_tokens, data_loading_seconds, data_loading_pct, wall_clock_per_step, step_overhead_pct (#1457).
- Warning when eval prompts are queuing up (new eval round starts before the previous one completes) (#1461).
- OLMo 3 tokenizer settings documentation covering chat template decisions for Instruct and Think models (#1455).
- torch.compile support for OLMo-core DPO training (#1445).
- Adds a GRPOTrainModule as part of the Olmo-core migration (#1412)
- FSDP shard_degree and num_replicas configuration for OLMo-core DPO training (#1446).
- Budget mode gradient checkpointing support for OLMo-core DPO training (#1444).
- PerfCallback for MFU metrics in OLMo-core DPO training (#1442).
- NVIDIA H200 GPU support in
GPU_SPECS(#1441). - Documentation and runtime warning for
dataset_mixer_listformat (float=proportion, int=count) (#1434).
Changed
- Bound async data preparation to stay within
async_stepsof training, preventing training data getting too far out of sync with trainer. (#1496). - Refactor Legacy and DRTulu tool parsers to use OpenAI-format
tool_definitionsinstead of Raytool_actors. Removesimport rayfromparsers.py, fixes DRTulu parser which was broken after the pool refactor, and fixes--tool_parser_typetypo in dr_tulu debug script (#1491). - Replaces lambda collators with a "single_example_collator" (#1472).
- Clarified
activation_memory_budgetguidance in DPO utils with a practical default (0.5) and memory/speed tradeoff notes (#1460). - Let TransformerTrainModule handle FSDP parallelism instead of manual application in DPO (#1458).
- Refactored DPOTrainModule to inherit from TransformerTrainModule (#1456)
- Increased vLLM health check timeout from 30s to 600s (10 minutes) (#1452).
- Updated vllm version to 0.14.1 (#1433).
- Changed default wandb x-axis from
episodetotraining_stepfor grpo_fast (#1437). - Made a bunch of changes to
dpo.pyso it matchesdpo_tune_cache.pyperfectly (#1451).
Fixed
- Updated previous fix of weight sync thread to only be active when
inflight_updates=False, removing an issue with weight sync updates stalling (#1499). - Fixed weight sync thread hang when
inflight_updates=False: wait for all vLLMengine.update_weightRPCs to complete before unpausing actors, preventinghealth_check_fnfrom blocking indefinitely (#1480). - Fixed
nodes_neededcalculation ingrpo_fastkv_cache_max_concurrencywarning usingmath.ceil()instead of floor division to avoid undercounting required inference nodes (#1474). - Fixed
eval_on_step_0never triggering ingrpo_fastbecause it was gated behind thetraining_step % local_eval_every == 0modulo check; also guardlocal_eval_every <= 0to prevent accidental every-step eval orZeroDivisionError(#1485). - Fixed
TypeErrorinpack_padded_sequenceswhenattention_maskis a float tensor, and vectorized the packing to avoid per-sequence host-device synchronizations (#1486). - Fixed silent prompt/ground-truth mismatch in RLVR caused by redundant dataset shuffle desyncing the
"index"column from positional indices, leading to wrong rewards and wrongexclude_indexexclusions (#1484). - Fixed test
single_example_collatorreturning raw int for index, causingTypeErrorin_iter_batches(#1477). - Fixed SFT integration test failing due to missing
--try_launch_beaker_eval_jobs falseflag (#1470). - Fixed checkpoint cleanup race condition on shared filesystems by using
ignore_errors=Trueand restricting cleanup to global rank 0 (#1468). - Fixed checkpoint resume failing on Beaker retries by removing non-deterministic timestamp from
exp_name(#1468). - Fixed MFU calculation to count LM head FLOPs per token (#1457).
- Fixed training hang when
inflight_updatesis disabled by waiting for weight sync to complete before health check (#1454). - Fixed evaluation responses being lost on timeout in grpo_fast by requeuing partial results (#1439).
- Beaker Experiment Launch now passes (#1424 (review)).
v0.1.0
We have made the following changes to open-instruct:
Added
- Added OLMo-core based DPO training script (#1391).
- Added SLURM scripts for OLMo SFT training with checkpoint resume support and configurable shuffle seed. #1368
- Added retry logic with exponential backoff to
make_api_requestfor tool API calls (retries on timeouts, connection errors, 429, and 5xx). Also added configurablemax_concurrencyparameter to tool configs for controlling Ray actor concurrency per-tool. #1388 - Added support for generic MCP tools during training, with some limitations (no changing tools, no tool discovery during training). For details: #1384
- Added the ability to set active tools on a per-sample basis. See the PR for more details: #1382
- Added a new changelog Github Action that makes sure you contribute to the changelog! #1276
- Now, we type check
open_instruct/dataset_transformation.py(#1390). - Added a linter rule that imports go at the top of the file (#1394).
- Refactors GRPO config into a grpo_utils.py file in preparation for Olmo-core implementation (#1396_.
- Now, we save the generated rollouts to disk during RL when the --save_traces flag is passed (#1406).
- Pulls out weight sync code from GRPO into a more generic function (#1411 (review))
Changed
- Updated library versions via
uv lock --upgrade(#1400). - Now,
large_test_script.shexercises thetp > 1code path (#1413).
Fixed
- Added automatic type coercion for tool arguments via
safe_execute()- prevents crashes when models send wrong types (e.g., bool instead of string) (#1418). - Fixed argparse conflict error for
--save_tracesby removing duplicate field definitions fromStreamingDataLoaderConfig(#1416). - Increased
MetricsTrackermax_metrics from 64 to 512 to fixValueError: Exceeded maximum number of metricswhen training with many tools or verifier functions (#1415). - Fixed JSON serialization error in
LocalDatasetTransformationCache.save_configwhen caching datasets locally (#1402). - Now, we can support PRs from external contributors while still maintaining security for internal tokens (#1408).
- Improved error handling for tool calls with missing/invalid arguments - now returns a clear error message instead of crashing (#1404).
- Fixed
GenerationConfigvalidation error when saving OLMo-3 models - config is now set after unwrapping the model, and OLMo-3 is detected from bothchat_template_nameand model name (#1404). - Fixed the benchmark so that it runs (#1401).