Skip to content

Releases: allenai/open-instruct

v0.2.0

02 Mar 16:16
aa29e21

Choose a tag to compare

Notable changelog items:

  1. Fixed a major but where ZeRO-2 was discarding gradients.
  2. We now support training with a RLEnvironment abstraction, including code execution sandboxing.
  3. The Olmo-core DPO is ready to go, with support for TP, packing, and torch.compile. This should be much faster than the previous HF implementation.
  4. We have a model merging implementation that runs on Beaker (for internal Ai2 users).
  5. Now, GRPO doesn't let the actors run excessively ahead of the learner during training if the actors are much faster than the learners.

Fixed

  • Fix ZeRO-2 discarding gradients during manual gradient accumulation by using set_gradient_accumulation_boundary() (#1498).

Added

  • Clean up OLMo 3.X tokenizer docs: clarify think SFT tokenization workaround, add dev/release tokenizer matrix, create allenai/olmo-3-tokenizer-instruct-release (#1487).
  • Add Docker sandbox backend and GenericSandboxEnv environment for code execution during RL training. DockerBackend with command timeout, configurable memory limits, put_archive/get_archive file I/O, and remove=True auto-cleanup. GenericSandboxEnv provides execute_bash (stateful bash with env/cwd persistence) and str_replace_editor (view/create/str_replace/insert with correct line numbering). Configurable penalty, image, and memory via GenericSandboxEnvConfig. Includes 1-GPU debug script (#1490).
  • TextRLEnvironment base class for text-based RL environments: model output is passed as a plain string instead of parsed tool calls, with response formatted using the parser's role_template. Includes WordleTextEnv example, role-aware format_tool_outputs in all parsers, shadow tool call dispatch in process_request, and 1-GPU debug script (#1489).
  • Wire RL environments into vLLM generation loop and preprocessing: unified tool/env system with single TOOL_REGISTRY, pooled actors via shared EnvironmentPool Ray actor (async acquire/release, auto-sized to rollout concurrency), RolloutState tracks all per-rollout state, PassthroughVerifier + RewardAggregator for per-turn rewards (verifier score folded into last turn before aggregation), BaseEnvConfig in environments/base.py, --max_steps unified, --pool_size configurable, auto-discovery of tools from datasets, 1-GPU debug scripts for counter/guess_number envs (#1479).
  • RL environment abstraction: RLEnvironment base class with Tool as a subclass, unifying tools and environments under a single step(EnvCall) -> StepResult interface. Removes Executable/EnvOutput/_execute/safe_execute indirection. Moves tools under open_instruct/environments/tools/. Includes example environments (CounterEnv, GuessNumberEnv) (#1478).
  • Enable packing with torch.compile for DPO training, fix cu_seq_lens offset bug for padded chosen/rejected sequences, add tokens_per_second_per_gpu metric (#1466).
  • Model merging scripts for Beaker: mergekit_merge.sh (mergekit) and direct_merge.sh (direct safetensors averaging for hybrid models) (#1459).
  • Production DPO script for OLMo3-7B hybrid (#1449).
  • Gradient accumulation/microbatching support for OLMo-core DPO training (#1447).
  • Evolving rubrics support with RubricVerifier and utility functions for GRPO training (#1460).
  • New perf metrics in PerfCallback: total_tokens, data_loading_seconds, data_loading_pct, wall_clock_per_step, step_overhead_pct (#1457).
  • Warning when eval prompts are queuing up (new eval round starts before the previous one completes) (#1461).
  • OLMo 3 tokenizer settings documentation covering chat template decisions for Instruct and Think models (#1455).
  • torch.compile support for OLMo-core DPO training (#1445).
  • Adds a GRPOTrainModule as part of the Olmo-core migration (#1412)
  • FSDP shard_degree and num_replicas configuration for OLMo-core DPO training (#1446).
  • Budget mode gradient checkpointing support for OLMo-core DPO training (#1444).
  • PerfCallback for MFU metrics in OLMo-core DPO training (#1442).
  • NVIDIA H200 GPU support in GPU_SPECS (#1441).
  • Documentation and runtime warning for dataset_mixer_list format (float=proportion, int=count) (#1434).

Changed

  • Bound async data preparation to stay within async_steps of training, preventing training data getting too far out of sync with trainer. (#1496).
  • Refactor Legacy and DRTulu tool parsers to use OpenAI-format tool_definitions instead of Ray tool_actors. Removes import ray from parsers.py, fixes DRTulu parser which was broken after the pool refactor, and fixes --tool_parser_type typo in dr_tulu debug script (#1491).
  • Replaces lambda collators with a "single_example_collator" (#1472).
  • Clarified activation_memory_budget guidance in DPO utils with a practical default (0.5) and memory/speed tradeoff notes (#1460).
  • Let TransformerTrainModule handle FSDP parallelism instead of manual application in DPO (#1458).
  • Refactored DPOTrainModule to inherit from TransformerTrainModule (#1456)
  • Increased vLLM health check timeout from 30s to 600s (10 minutes) (#1452).
  • Updated vllm version to 0.14.1 (#1433).
  • Changed default wandb x-axis from episode to training_step for grpo_fast (#1437).
  • Made a bunch of changes to dpo.py so it matches dpo_tune_cache.py perfectly (#1451).

Fixed

  • Updated previous fix of weight sync thread to only be active when inflight_updates=False, removing an issue with weight sync updates stalling (#1499).
  • Fixed weight sync thread hang when inflight_updates=False: wait for all vLLM engine.update_weight RPCs to complete before unpausing actors, preventing health_check_fn from blocking indefinitely (#1480).
  • Fixed nodes_needed calculation in grpo_fast kv_cache_max_concurrency warning using math.ceil() instead of floor division to avoid undercounting required inference nodes (#1474).
  • Fixed eval_on_step_0 never triggering in grpo_fast because it was gated behind the training_step % local_eval_every == 0 modulo check; also guard local_eval_every <= 0 to prevent accidental every-step eval or ZeroDivisionError (#1485).
  • Fixed TypeError in pack_padded_sequences when attention_mask is a float tensor, and vectorized the packing to avoid per-sequence host-device synchronizations (#1486).
  • Fixed silent prompt/ground-truth mismatch in RLVR caused by redundant dataset shuffle desyncing the "index" column from positional indices, leading to wrong rewards and wrong exclude_index exclusions (#1484).
  • Fixed test single_example_collator returning raw int for index, causing TypeError in _iter_batches (#1477).
  • Fixed SFT integration test failing due to missing --try_launch_beaker_eval_jobs false flag (#1470).
  • Fixed checkpoint cleanup race condition on shared filesystems by using ignore_errors=True and restricting cleanup to global rank 0 (#1468).
  • Fixed checkpoint resume failing on Beaker retries by removing non-deterministic timestamp from exp_name (#1468).
  • Fixed MFU calculation to count LM head FLOPs per token (#1457).
  • Fixed training hang when inflight_updates is disabled by waiting for weight sync to complete before health check (#1454).
  • Fixed evaluation responses being lost on timeout in grpo_fast by requeuing partial results (#1439).
  • Beaker Experiment Launch now passes (#1424 (review)).

v0.1.0

26 Jan 18:15
8befd55

Choose a tag to compare

We have made the following changes to open-instruct:

Added

  • Added OLMo-core based DPO training script (#1391).
  • Added SLURM scripts for OLMo SFT training with checkpoint resume support and configurable shuffle seed. #1368
  • Added retry logic with exponential backoff to make_api_request for tool API calls (retries on timeouts, connection errors, 429, and 5xx). Also added configurable max_concurrency parameter to tool configs for controlling Ray actor concurrency per-tool. #1388
  • Added support for generic MCP tools during training, with some limitations (no changing tools, no tool discovery during training). For details: #1384
  • Added the ability to set active tools on a per-sample basis. See the PR for more details: #1382
  • Added a new changelog Github Action that makes sure you contribute to the changelog! #1276
  • Now, we type check open_instruct/dataset_transformation.py (#1390).
  • Added a linter rule that imports go at the top of the file (#1394).
  • Refactors GRPO config into a grpo_utils.py file in preparation for Olmo-core implementation (#1396_.
  • Now, we save the generated rollouts to disk during RL when the --save_traces flag is passed (#1406).
  • Pulls out weight sync code from GRPO into a more generic function (#1411 (review))

Changed

  • Updated library versions via uv lock --upgrade (#1400).
  • Now, large_test_script.sh exercises the tp > 1 code path (#1413).

Fixed

  • Added automatic type coercion for tool arguments via safe_execute() - prevents crashes when models send wrong types (e.g., bool instead of string) (#1418).
  • Fixed argparse conflict error for --save_traces by removing duplicate field definitions from StreamingDataLoaderConfig (#1416).
  • Increased MetricsTracker max_metrics from 64 to 512 to fix ValueError: Exceeded maximum number of metrics when training with many tools or verifier functions (#1415).
  • Fixed JSON serialization error in LocalDatasetTransformationCache.save_config when caching datasets locally (#1402).
  • Now, we can support PRs from external contributors while still maintaining security for internal tokens (#1408).
  • Improved error handling for tool calls with missing/invalid arguments - now returns a clear error message instead of crashing (#1404).
  • Fixed GenerationConfig validation error when saving OLMo-3 models - config is now set after unwrapping the model, and OLMo-3 is detected from both chat_template_name and model name (#1404).
  • Fixed the benchmark so that it runs (#1401).

Removed

  • Removed open_instruct/ppo.py and related PPO training scripts (#1395).
  • Removed scripts/train/debug/tool_grpo_fast.sh; use scripts/train/debug/tools/olmo_3_parser_multigpu.sh for tool use experiments (#1404).