feat(gateway): cumulative token mode for the in-process (Tinker) path#658
Merged
Merged
Conversation
`cumulative_token_mode` (drift-free multi-turn token forwarding: turn 2+ is rewritten to a pre-tokenized prompt built by renderers.bridge_to_next_turn, avoiding decode→re-encode drift) previously only worked for the HTTP-proxy (vLLM/verl) path — the cumulative handlers always routed to an HTTP worker's /v1/completions, so backends that run in-process via `local_handler` (Tinker) fell back to re-rendering + re-tokenizing the full conversation every turn. This wires the same feature through the local_handler path, reusing the existing backend-agnostic TokenAccumulator + Prime Intellect `renderers` bridge: - tinker_adapter: the handler now detects a pre-tokenized `prompt` (list[int]) and samples straight from it via the engine's existing get_token_output_from_token_input + assemble_model_output, returning a completions-style body (prompt_token_ids + choices[].token_ids) — the shape the gateway's cumulative handler already extracts token IDs from. The chat (messages) path is unchanged. - proxy: _handle_cumulative_non_streaming routes to local_handler when present (no HTTP worker); added _handle_cumulative_streaming_local to synthesize an SSE stream from the single in-process completion (mirrors _handle_streaming_local), with the same token ingest. Net effect: Tinker rollouts get the same prefix-extension guarantee verl has — turn N's prompt tokens are byte-for-byte prior turns' prompt+completion tokens, so the sequence the optimizer trains on matches what was generated. Tests (no Tinker service / model required): - tests/test_tinker_adapter_cumulative.py — token-prompt path samples from tokens; chat path unchanged; non-int prompt falls through. - rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py — proxy local cumulative non-streaming + streaming ingest + chat translation. Existing cumulative/accumulator suites (28) still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jeffreysijuntan
added a commit
that referenced
this pull request
Jun 16, 2026
Enables the in-process cumulative-token path (cherry-picked from #658) in the tinker recipe for testing: rllm.gateway.cumulative_token_mode=true + renderer_family=qwen3.5. Also raises training.max_length=65536 / data.max_prompt_length=57344 so long mini-swe-agent trajectories fit, and save_freq=10. NOTE: the cumulative feature commit on this branch is a cherry-pick of #658 for local testing — drop it (rebase onto main) once #658 merges to avoid duplication. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4 tasks
jeffreysijuntan
added a commit
that referenced
this pull request
Jun 17, 2026
* feat(cookbooks): add swe-rl recipe (rllm-swesmith → SWE-bench Verified)
End-to-end SWE-RL cookbook that pairs rLLM's native `rllm-swesmith`
training set with `harbor:swebench-verified` for eval, driving the
in-tree `mini-swe-agent` harness inside per-task sandboxes. Default
model is Qwen/Qwen3.5-9B + LoRA-32, GRPO + async + compact filtering,
64 parallel Daytona sandboxes.
No custom AgentFlow or evaluator: the harness owns the action loop,
each task's `tests/test.sh` is the verifier (pytest for swesmith, the
official SWE-bench harness for Verified), and the gateway captures
trajectories transparently.
Files: prepare_data.py (dataset pull), train.py + train_{tinker,verl}.sh
(recipes), test.py (catalog + harness smoke tests), README.md, pyproject.toml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cookbooks/swe-rl): use rllm.rollout.* keys; drop stale gateway overrides
The unified trainer config exposes sampling params at
`rllm.rollout.{train,val}.{temperature,top_p}`. The `sampling.*` paths
copied from `examples/harbor_swe/train_harbor.sh` predate the unified
config and break Hydra struct validation. Same template also passed
`rllm.gateway.public_url` / `rllm.gateway.sampling_params_priority`,
neither of which exist in the current schema — drop them.
Verified by resolving the config with --cfg job: no validation errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(swe-rl): run on rLLM-native SandboxedAgentFlow path; unblock async training
Switch the cookbook off the remote Harbor runtime onto rLLM's own
SandboxedAgentFlow path (AgentFlowEngine) and fix the bugs that surfaced
once real rollouts ran end-to-end on Modal sandboxes.
train.py / train_*.sh:
- Pass MiniSweAgentHarness as agent_flow so AgentTrainer auto-wires
SandboxTaskHooks + per-task verifiers via AgentFlowEngine (not the
remote_runtime/RemoteAgentFlowEngine path). Sandbox backend selected by
SWE_SANDBOX_BACKEND (default modal).
- Load the val split as "default" (Harbor-pulled name), not "test".
- Async training requires train_batch_size=1 and raise_on_error=false; set
both. Effective batch is mini_batch_size(16) groups x group_size(8).
- Add SWE_VAL_MAX to cap the 500-task SWE-bench-Verified val set.
- Drop stale rllm.remote_runtime.* overrides; doc the sandbox backends.
rllm/data/utils.py:
- task_from_row now roots the Task at the row's task_path and merges
task.toml/Dockerfile metadata, so per-task verifier + image resolution
work on the training path (fixes "No verifier configured").
rllm/gateway/tinker_adapter.py:
- Translate TerminationEvent into an OpenAI-standard 400
context_length_exceeded instead of a 500. litellm maps it to a
non-retryable ContextWindowExceededError, so an over-length in-sandbox
agent stops immediately instead of retrying until the run-timeout
SIGKILL — which was stalling group completion and wedging async steps.
rllm/harnesses/mini_swe_agent.py:
- Retry the in-sandbox uv install (Modal->GitHub egress resets the
connection intermittently and aborted the install under set -e).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(gateway): cumulative token mode for the in-process (Tinker) path
`cumulative_token_mode` (drift-free multi-turn token forwarding: turn 2+ is
rewritten to a pre-tokenized prompt built by renderers.bridge_to_next_turn,
avoiding decode→re-encode drift) previously only worked for the HTTP-proxy
(vLLM/verl) path — the cumulative handlers always routed to an HTTP worker's
/v1/completions, so backends that run in-process via `local_handler` (Tinker)
fell back to re-rendering + re-tokenizing the full conversation every turn.
This wires the same feature through the local_handler path, reusing the
existing backend-agnostic TokenAccumulator + Prime Intellect `renderers`
bridge:
- tinker_adapter: the handler now detects a pre-tokenized `prompt` (list[int])
and samples straight from it via the engine's existing
get_token_output_from_token_input + assemble_model_output, returning a
completions-style body (prompt_token_ids + choices[].token_ids) — the shape
the gateway's cumulative handler already extracts token IDs from. The chat
(messages) path is unchanged.
- proxy: _handle_cumulative_non_streaming routes to local_handler when present
(no HTTP worker); added _handle_cumulative_streaming_local to synthesize an
SSE stream from the single in-process completion (mirrors
_handle_streaming_local), with the same token ingest.
Net effect: Tinker rollouts get the same prefix-extension guarantee verl has —
turn N's prompt tokens are byte-for-byte prior turns' prompt+completion tokens,
so the sequence the optimizer trains on matches what was generated.
Tests (no Tinker service / model required):
- tests/test_tinker_adapter_cumulative.py — token-prompt path samples from
tokens; chat path unchanged; non-int prompt falls through.
- rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py — proxy
local cumulative non-streaming + streaming ingest + chat translation.
Existing cumulative/accumulator suites (28) still pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(swe-rl): enable cumulative token mode + raise context window
Enables the in-process cumulative-token path (cherry-picked from #658) in the
tinker recipe for testing: rllm.gateway.cumulative_token_mode=true +
renderer_family=qwen3.5. Also raises training.max_length=65536 /
data.max_prompt_length=57344 so long mini-swe-agent trajectories fit, and
save_freq=10.
NOTE: the cumulative feature commit on this branch is a cherry-pick of #658 for
local testing — drop it (rebase onto main) once #658 merges to avoid duplication.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(swe-rl): add synchronous tinker training script
train_tinker_sync.sh: on-policy variant of train_tinker.sh for testing —
drops async_training (synchronous generate→train) and uses a real
data.train_batch_size (default 4; effective batch = train_batch_size x
group_size). Same model/sandbox/context/cumulative settings otherwise.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(swe-rl): disable cumulative token mode (incompatible with thinking model)
Qwen3.5 is a thinking model (<think>...</think> per assistant turn). The normal
re-render path strips prior-turn reasoning from history (mini-swe-agent stores
the think-stripped message.content), but cumulative mode carries forward the RAW
completion tokens (tinker_engine completion_ids=response_tokens, incl. <think>),
which renderers.bridge_to_next_turn concatenates verbatim. The model then sees an
ever-growing stack of its own prior reasoning — out-of-distribution for multi-turn
— degrading into short, submit-early, zero-reward trajectories.
Renderer mismatch was ruled out: the bridge (PrimeIntellect qwen3.5) and the
engine renderer are token-identical; the bridged prompt is well-formed. The
issue is purely reasoning carry-forward. Cumulative mode is fine for
non-thinking models / single-turn; for thinking models, correct multi-turn
rollout requires stripping prior reasoning (i.e. re-tokenization), which is
fundamentally at odds with exact-token reuse.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(swe-rl): switch to Qwen3.5-4B + terminus2 harness
- agent_flow -> Terminus2Harness (was MiniSweAgentHarness) in train.py;
docstring/script prose updated accordingly.
- model.name / MODEL_PATH -> Qwen/Qwen3.5-4B; renderer_family -> qwen3.5;
experiment names -> r2egym-terminus2-qwen3.5-4b[-sync/-verl].
- re-enable rllm.gateway.cumulative_token_mode in the tinker scripts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(swe-rl): bump n_parallel_tasks to 128 (tinker + sync)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Extends
cumulative_token_mode— the drift-free multi-turn token-forwarding feature — to the in-processlocal_handler(Tinker) path. Until now it only worked for the HTTP-proxy (vLLM/verl) backend; Tinker rollouts re-rendered + re-tokenized the full conversation every turn.Why
In multi-turn RL, re-tokenizing a rendered string each turn can split tokens differently than the concatenation of the tokens actually sampled in prior turns, so the sequence the optimizer trains on can drift from what was generated. The HTTP path already avoids this by rewriting turn 2+ to
/v1/completionswith raw token IDs fromrenderers.bridge_to_next_turn(the Prime Intellectrendererspackage). The cumulative handlers, however, always routed to an HTTP worker — so Tinker (in-process, no vLLM worker) never got it.How
The accumulator (
TokenAccumulator) and therenderersbridge are already backend-agnostic; the only gaps were the handler and the routing:rllm/gateway/tinker_adapter.py— the handler now detects a pre-tokenizedprompt(list[int]) and samples straight from it via the engine's existingget_token_output_from_token_input+assemble_model_output, returning a completions-style body (prompt_token_ids+choices[].token_ids) — exactly the shape the gateway's cumulative handler extracts token IDs from. The chat (messages) path is untouched.rllm-model-gateway/.../proxy.py—_handle_cumulative_non_streamingroutes tolocal_handlerwhen present (no HTTP worker); new_handle_cumulative_streaming_localsynthesizes an SSE stream from the single in-process completion (mirrors the existing_handle_streaming_local), with the same token ingest.Net effect: Tinker rollouts get the same prefix-extension guarantee verl has — turn N's prompt tokens are byte-for-byte the prior turns' prompt+completion tokens.
Tests (no Tinker service / model needed)
tests/test_tinker_adapter_cumulative.py— token-prompt path samples from tokens (bypassing rendering); chat path unchanged; non-intpromptfalls through.rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py— proxy local cumulative non-streaming + streaming: ingests bridged prompt + completion, translates back to chat.Notes / follow-ups
cumulative_token_modeflag (defaultfalse); the default chat path is unchanged.tool_calls/reasoningstructurally through the cumulative path is a possible follow-up.🤖 Generated with Claude Code