feat(gateway): cumulative token mode for the in-process (Tinker) path by jeffreysijuntan · Pull Request #658 · rllm-org/rllm

jeffreysijuntan · 2026-06-16T02:50:37Z

What

Extends cumulative_token_mode — the drift-free multi-turn token-forwarding feature — to the in-process local_handler (Tinker) path. Until now it only worked for the HTTP-proxy (vLLM/verl) backend; Tinker rollouts re-rendered + re-tokenized the full conversation every turn.

Why

In multi-turn RL, re-tokenizing a rendered string each turn can split tokens differently than the concatenation of the tokens actually sampled in prior turns, so the sequence the optimizer trains on can drift from what was generated. The HTTP path already avoids this by rewriting turn 2+ to /v1/completions with raw token IDs from renderers.bridge_to_next_turn (the Prime Intellect renderers package). The cumulative handlers, however, always routed to an HTTP worker — so Tinker (in-process, no vLLM worker) never got it.

How

The accumulator (TokenAccumulator) and the renderers bridge are already backend-agnostic; the only gaps were the handler and the routing:

rllm/gateway/tinker_adapter.py — the handler now detects a pre-tokenized prompt (list[int]) and samples straight from it via the engine's existing get_token_output_from_token_input + assemble_model_output, returning a completions-style body (prompt_token_ids + choices[].token_ids) — exactly the shape the gateway's cumulative handler extracts token IDs from. The chat (messages) path is untouched.
rllm-model-gateway/.../proxy.py — _handle_cumulative_non_streaming routes to local_handler when present (no HTTP worker); new _handle_cumulative_streaming_local synthesizes an SSE stream from the single in-process completion (mirrors the existing _handle_streaming_local), with the same token ingest.

Net effect: Tinker rollouts get the same prefix-extension guarantee verl has — turn N's prompt tokens are byte-for-byte the prior turns' prompt+completion tokens.

Tests (no Tinker service / model needed)

tests/test_tinker_adapter_cumulative.py — token-prompt path samples from tokens (bypassing rendering); chat path unchanged; non-int prompt falls through.
rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py — proxy local cumulative non-streaming + streaming: ingests bridged prompt + completion, translates back to chat.
Existing cumulative/accumulator suites (28 tests) still pass.

Notes / follow-ups

Gated entirely by the existing cumulative_token_mode flag (default false); the default chat path is unchanged.
The cumulative→chat translation forwards the assistant text content (sufficient for text-action agents like mini-swe-agent). Preserving tool_calls/reasoning structurally through the cumulative path is a possible follow-up.
Not yet validated against a live Tinker run — verified at the unit/integration level (the prefix-extension invariant and ingest). A short live run to confirm turn-2 prompt IDs equal turn-1 prompt+completion IDs is the recommended next step before relying on it in training.

🤖 Generated with Claude Code

`cumulative_token_mode` (drift-free multi-turn token forwarding: turn 2+ is rewritten to a pre-tokenized prompt built by renderers.bridge_to_next_turn, avoiding decode→re-encode drift) previously only worked for the HTTP-proxy (vLLM/verl) path — the cumulative handlers always routed to an HTTP worker's /v1/completions, so backends that run in-process via `local_handler` (Tinker) fell back to re-rendering + re-tokenizing the full conversation every turn. This wires the same feature through the local_handler path, reusing the existing backend-agnostic TokenAccumulator + Prime Intellect `renderers` bridge: - tinker_adapter: the handler now detects a pre-tokenized `prompt` (list[int]) and samples straight from it via the engine's existing get_token_output_from_token_input + assemble_model_output, returning a completions-style body (prompt_token_ids + choices[].token_ids) — the shape the gateway's cumulative handler already extracts token IDs from. The chat (messages) path is unchanged. - proxy: _handle_cumulative_non_streaming routes to local_handler when present (no HTTP worker); added _handle_cumulative_streaming_local to synthesize an SSE stream from the single in-process completion (mirrors _handle_streaming_local), with the same token ingest. Net effect: Tinker rollouts get the same prefix-extension guarantee verl has — turn N's prompt tokens are byte-for-byte prior turns' prompt+completion tokens, so the sequence the optimizer trains on matches what was generated. Tests (no Tinker service / model required): - tests/test_tinker_adapter_cumulative.py — token-prompt path samples from tokens; chat path unchanged; non-int prompt falls through. - rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py — proxy local cumulative non-streaming + streaming ingest + chat translation. Existing cumulative/accumulator suites (28) still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Enables the in-process cumulative-token path (cherry-picked from #658) in the tinker recipe for testing: rllm.gateway.cumulative_token_mode=true + renderer_family=qwen3.5. Also raises training.max_length=65536 / data.max_prompt_length=57344 so long mini-swe-agent trajectories fit, and save_freq=10. NOTE: the cumulative feature commit on this branch is a cherry-pick of #658 for local testing — drop it (rebase onto main) once #658 merges to avoid duplication. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(cookbooks): add swe-rl recipe (rllm-swesmith → SWE-bench Verified) End-to-end SWE-RL cookbook that pairs rLLM's native `rllm-swesmith` training set with `harbor:swebench-verified` for eval, driving the in-tree `mini-swe-agent` harness inside per-task sandboxes. Default model is Qwen/Qwen3.5-9B + LoRA-32, GRPO + async + compact filtering, 64 parallel Daytona sandboxes. No custom AgentFlow or evaluator: the harness owns the action loop, each task's `tests/test.sh` is the verifier (pytest for swesmith, the official SWE-bench harness for Verified), and the gateway captures trajectories transparently. Files: prepare_data.py (dataset pull), train.py + train_{tinker,verl}.sh (recipes), test.py (catalog + harness smoke tests), README.md, pyproject.toml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cookbooks/swe-rl): use rllm.rollout.* keys; drop stale gateway overrides The unified trainer config exposes sampling params at `rllm.rollout.{train,val}.{temperature,top_p}`. The `sampling.*` paths copied from `examples/harbor_swe/train_harbor.sh` predate the unified config and break Hydra struct validation. Same template also passed `rllm.gateway.public_url` / `rllm.gateway.sampling_params_priority`, neither of which exist in the current schema — drop them. Verified by resolving the config with --cfg job: no validation errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(swe-rl): run on rLLM-native SandboxedAgentFlow path; unblock async training Switch the cookbook off the remote Harbor runtime onto rLLM's own SandboxedAgentFlow path (AgentFlowEngine) and fix the bugs that surfaced once real rollouts ran end-to-end on Modal sandboxes. train.py / train_*.sh: - Pass MiniSweAgentHarness as agent_flow so AgentTrainer auto-wires SandboxTaskHooks + per-task verifiers via AgentFlowEngine (not the remote_runtime/RemoteAgentFlowEngine path). Sandbox backend selected by SWE_SANDBOX_BACKEND (default modal). - Load the val split as "default" (Harbor-pulled name), not "test". - Async training requires train_batch_size=1 and raise_on_error=false; set both. Effective batch is mini_batch_size(16) groups x group_size(8). - Add SWE_VAL_MAX to cap the 500-task SWE-bench-Verified val set. - Drop stale rllm.remote_runtime.* overrides; doc the sandbox backends. rllm/data/utils.py: - task_from_row now roots the Task at the row's task_path and merges task.toml/Dockerfile metadata, so per-task verifier + image resolution work on the training path (fixes "No verifier configured"). rllm/gateway/tinker_adapter.py: - Translate TerminationEvent into an OpenAI-standard 400 context_length_exceeded instead of a 500. litellm maps it to a non-retryable ContextWindowExceededError, so an over-length in-sandbox agent stops immediately instead of retrying until the run-timeout SIGKILL — which was stalling group completion and wedging async steps. rllm/harnesses/mini_swe_agent.py: - Retry the in-sandbox uv install (Modal->GitHub egress resets the connection intermittently and aborted the install under set -e). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(gateway): cumulative token mode for the in-process (Tinker) path `cumulative_token_mode` (drift-free multi-turn token forwarding: turn 2+ is rewritten to a pre-tokenized prompt built by renderers.bridge_to_next_turn, avoiding decode→re-encode drift) previously only worked for the HTTP-proxy (vLLM/verl) path — the cumulative handlers always routed to an HTTP worker's /v1/completions, so backends that run in-process via `local_handler` (Tinker) fell back to re-rendering + re-tokenizing the full conversation every turn. This wires the same feature through the local_handler path, reusing the existing backend-agnostic TokenAccumulator + Prime Intellect `renderers` bridge: - tinker_adapter: the handler now detects a pre-tokenized `prompt` (list[int]) and samples straight from it via the engine's existing get_token_output_from_token_input + assemble_model_output, returning a completions-style body (prompt_token_ids + choices[].token_ids) — the shape the gateway's cumulative handler already extracts token IDs from. The chat (messages) path is unchanged. - proxy: _handle_cumulative_non_streaming routes to local_handler when present (no HTTP worker); added _handle_cumulative_streaming_local to synthesize an SSE stream from the single in-process completion (mirrors _handle_streaming_local), with the same token ingest. Net effect: Tinker rollouts get the same prefix-extension guarantee verl has — turn N's prompt tokens are byte-for-byte prior turns' prompt+completion tokens, so the sequence the optimizer trains on matches what was generated. Tests (no Tinker service / model required): - tests/test_tinker_adapter_cumulative.py — token-prompt path samples from tokens; chat path unchanged; non-int prompt falls through. - rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py — proxy local cumulative non-streaming + streaming ingest + chat translation. Existing cumulative/accumulator suites (28) still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(swe-rl): enable cumulative token mode + raise context window Enables the in-process cumulative-token path (cherry-picked from #658) in the tinker recipe for testing: rllm.gateway.cumulative_token_mode=true + renderer_family=qwen3.5. Also raises training.max_length=65536 / data.max_prompt_length=57344 so long mini-swe-agent trajectories fit, and save_freq=10. NOTE: the cumulative feature commit on this branch is a cherry-pick of #658 for local testing — drop it (rebase onto main) once #658 merges to avoid duplication. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(swe-rl): add synchronous tinker training script train_tinker_sync.sh: on-policy variant of train_tinker.sh for testing — drops async_training (synchronous generate→train) and uses a real data.train_batch_size (default 4; effective batch = train_batch_size x group_size). Same model/sandbox/context/cumulative settings otherwise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(swe-rl): disable cumulative token mode (incompatible with thinking model) Qwen3.5 is a thinking model (<think>...</think> per assistant turn). The normal re-render path strips prior-turn reasoning from history (mini-swe-agent stores the think-stripped message.content), but cumulative mode carries forward the RAW completion tokens (tinker_engine completion_ids=response_tokens, incl. <think>), which renderers.bridge_to_next_turn concatenates verbatim. The model then sees an ever-growing stack of its own prior reasoning — out-of-distribution for multi-turn — degrading into short, submit-early, zero-reward trajectories. Renderer mismatch was ruled out: the bridge (PrimeIntellect qwen3.5) and the engine renderer are token-identical; the bridged prompt is well-formed. The issue is purely reasoning carry-forward. Cumulative mode is fine for non-thinking models / single-turn; for thinking models, correct multi-turn rollout requires stripping prior reasoning (i.e. re-tokenization), which is fundamentally at odds with exact-token reuse. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(swe-rl): switch to Qwen3.5-4B + terminus2 harness - agent_flow -> Terminus2Harness (was MiniSweAgentHarness) in train.py; docstring/script prose updated accordingly. - model.name / MODEL_PATH -> Qwen/Qwen3.5-4B; renderer_family -> qwen3.5; experiment names -> r2egym-terminus2-qwen3.5-4b[-sync/-verl]. - re-enable rllm.gateway.cumulative_token_mode in the tinker scripts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(swe-rl): bump n_parallel_tasks to 128 (tinker + sync) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jeffreysijuntan marked this pull request as draft June 16, 2026 03:03

style(gateway): wrap long SSE chunk lines to satisfy ruff E501

905f476

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jeffreysijuntan marked this pull request as ready for review June 16, 2026 19:07

jeffreysijuntan changed the base branch from main to terminal-rl June 16, 2026 19:11

jeffreysijuntan merged commit ae517ab into terminal-rl Jun 16, 2026
5 checks passed

jeffreysijuntan deleted the feat/tinker-cumulative-token-mode branch June 16, 2026 19:13

jeffreysijuntan mentioned this pull request Jun 16, 2026

feat(cookbooks): add swe-rl recipe (R2E-Gym → SWE-bench Verified) #653

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): cumulative token mode for the in-process (Tinker) path#658

feat(gateway): cumulative token mode for the in-process (Tinker) path#658
jeffreysijuntan merged 2 commits into
terminal-rlfrom
feat/tinker-cumulative-token-mode

jeffreysijuntan commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeffreysijuntan commented Jun 16, 2026

What

Why

How

Tests (no Tinker service / model needed)

Notes / follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant