Skip to content

feat(gateway): cumulative token mode for the in-process (Tinker) path#658

Merged
jeffreysijuntan merged 2 commits into
terminal-rlfrom
feat/tinker-cumulative-token-mode
Jun 16, 2026
Merged

feat(gateway): cumulative token mode for the in-process (Tinker) path#658
jeffreysijuntan merged 2 commits into
terminal-rlfrom
feat/tinker-cumulative-token-mode

Conversation

@jeffreysijuntan

Copy link
Copy Markdown
Contributor

What

Extends cumulative_token_mode — the drift-free multi-turn token-forwarding feature — to the in-process local_handler (Tinker) path. Until now it only worked for the HTTP-proxy (vLLM/verl) backend; Tinker rollouts re-rendered + re-tokenized the full conversation every turn.

Why

In multi-turn RL, re-tokenizing a rendered string each turn can split tokens differently than the concatenation of the tokens actually sampled in prior turns, so the sequence the optimizer trains on can drift from what was generated. The HTTP path already avoids this by rewriting turn 2+ to /v1/completions with raw token IDs from renderers.bridge_to_next_turn (the Prime Intellect renderers package). The cumulative handlers, however, always routed to an HTTP worker — so Tinker (in-process, no vLLM worker) never got it.

How

The accumulator (TokenAccumulator) and the renderers bridge are already backend-agnostic; the only gaps were the handler and the routing:

  • rllm/gateway/tinker_adapter.py — the handler now detects a pre-tokenized prompt (list[int]) and samples straight from it via the engine's existing get_token_output_from_token_input + assemble_model_output, returning a completions-style body (prompt_token_ids + choices[].token_ids) — exactly the shape the gateway's cumulative handler extracts token IDs from. The chat (messages) path is untouched.
  • rllm-model-gateway/.../proxy.py_handle_cumulative_non_streaming routes to local_handler when present (no HTTP worker); new _handle_cumulative_streaming_local synthesizes an SSE stream from the single in-process completion (mirrors the existing _handle_streaming_local), with the same token ingest.

Net effect: Tinker rollouts get the same prefix-extension guarantee verl has — turn N's prompt tokens are byte-for-byte the prior turns' prompt+completion tokens.

Tests (no Tinker service / model needed)

  • tests/test_tinker_adapter_cumulative.py — token-prompt path samples from tokens (bypassing rendering); chat path unchanged; non-int prompt falls through.
  • rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py — proxy local cumulative non-streaming + streaming: ingests bridged prompt + completion, translates back to chat.
  • Existing cumulative/accumulator suites (28 tests) still pass.

Notes / follow-ups

  • Gated entirely by the existing cumulative_token_mode flag (default false); the default chat path is unchanged.
  • The cumulative→chat translation forwards the assistant text content (sufficient for text-action agents like mini-swe-agent). Preserving tool_calls/reasoning structurally through the cumulative path is a possible follow-up.
  • Not yet validated against a live Tinker run — verified at the unit/integration level (the prefix-extension invariant and ingest). A short live run to confirm turn-2 prompt IDs equal turn-1 prompt+completion IDs is the recommended next step before relying on it in training.

🤖 Generated with Claude Code

`cumulative_token_mode` (drift-free multi-turn token forwarding: turn 2+ is
rewritten to a pre-tokenized prompt built by renderers.bridge_to_next_turn,
avoiding decode→re-encode drift) previously only worked for the HTTP-proxy
(vLLM/verl) path — the cumulative handlers always routed to an HTTP worker's
/v1/completions, so backends that run in-process via `local_handler` (Tinker)
fell back to re-rendering + re-tokenizing the full conversation every turn.

This wires the same feature through the local_handler path, reusing the
existing backend-agnostic TokenAccumulator + Prime Intellect `renderers`
bridge:

- tinker_adapter: the handler now detects a pre-tokenized `prompt` (list[int])
  and samples straight from it via the engine's existing
  get_token_output_from_token_input + assemble_model_output, returning a
  completions-style body (prompt_token_ids + choices[].token_ids) — the shape
  the gateway's cumulative handler already extracts token IDs from. The chat
  (messages) path is unchanged.
- proxy: _handle_cumulative_non_streaming routes to local_handler when present
  (no HTTP worker); added _handle_cumulative_streaming_local to synthesize an
  SSE stream from the single in-process completion (mirrors
  _handle_streaming_local), with the same token ingest.

Net effect: Tinker rollouts get the same prefix-extension guarantee verl has —
turn N's prompt tokens are byte-for-byte prior turns' prompt+completion tokens,
so the sequence the optimizer trains on matches what was generated.

Tests (no Tinker service / model required):
- tests/test_tinker_adapter_cumulative.py — token-prompt path samples from
  tokens; chat path unchanged; non-int prompt falls through.
- rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py — proxy
  local cumulative non-streaming + streaming ingest + chat translation.
Existing cumulative/accumulator suites (28) still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeffreysijuntan jeffreysijuntan marked this pull request as draft June 16, 2026 03:03
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeffreysijuntan jeffreysijuntan marked this pull request as ready for review June 16, 2026 19:07
@jeffreysijuntan jeffreysijuntan changed the base branch from main to terminal-rl June 16, 2026 19:11
@jeffreysijuntan jeffreysijuntan merged commit ae517ab into terminal-rl Jun 16, 2026
5 checks passed
@jeffreysijuntan jeffreysijuntan deleted the feat/tinker-cumulative-token-mode branch June 16, 2026 19:13
jeffreysijuntan added a commit that referenced this pull request Jun 16, 2026
Enables the in-process cumulative-token path (cherry-picked from #658) in the
tinker recipe for testing: rllm.gateway.cumulative_token_mode=true +
renderer_family=qwen3.5. Also raises training.max_length=65536 /
data.max_prompt_length=57344 so long mini-swe-agent trajectories fit, and
save_freq=10.

NOTE: the cumulative feature commit on this branch is a cherry-pick of #658 for
local testing — drop it (rebase onto main) once #658 merges to avoid duplication.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jeffreysijuntan added a commit that referenced this pull request Jun 17, 2026
* feat(cookbooks): add swe-rl recipe (rllm-swesmith → SWE-bench Verified)

End-to-end SWE-RL cookbook that pairs rLLM's native `rllm-swesmith`
training set with `harbor:swebench-verified` for eval, driving the
in-tree `mini-swe-agent` harness inside per-task sandboxes. Default
model is Qwen/Qwen3.5-9B + LoRA-32, GRPO + async + compact filtering,
64 parallel Daytona sandboxes.

No custom AgentFlow or evaluator: the harness owns the action loop,
each task's `tests/test.sh` is the verifier (pytest for swesmith, the
official SWE-bench harness for Verified), and the gateway captures
trajectories transparently.

Files: prepare_data.py (dataset pull), train.py + train_{tinker,verl}.sh
(recipes), test.py (catalog + harness smoke tests), README.md, pyproject.toml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cookbooks/swe-rl): use rllm.rollout.* keys; drop stale gateway overrides

The unified trainer config exposes sampling params at
`rllm.rollout.{train,val}.{temperature,top_p}`. The `sampling.*` paths
copied from `examples/harbor_swe/train_harbor.sh` predate the unified
config and break Hydra struct validation. Same template also passed
`rllm.gateway.public_url` / `rllm.gateway.sampling_params_priority`,
neither of which exist in the current schema — drop them.

Verified by resolving the config with --cfg job: no validation errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(swe-rl): run on rLLM-native SandboxedAgentFlow path; unblock async training

Switch the cookbook off the remote Harbor runtime onto rLLM's own
SandboxedAgentFlow path (AgentFlowEngine) and fix the bugs that surfaced
once real rollouts ran end-to-end on Modal sandboxes.

train.py / train_*.sh:
- Pass MiniSweAgentHarness as agent_flow so AgentTrainer auto-wires
  SandboxTaskHooks + per-task verifiers via AgentFlowEngine (not the
  remote_runtime/RemoteAgentFlowEngine path). Sandbox backend selected by
  SWE_SANDBOX_BACKEND (default modal).
- Load the val split as "default" (Harbor-pulled name), not "test".
- Async training requires train_batch_size=1 and raise_on_error=false; set
  both. Effective batch is mini_batch_size(16) groups x group_size(8).
- Add SWE_VAL_MAX to cap the 500-task SWE-bench-Verified val set.
- Drop stale rllm.remote_runtime.* overrides; doc the sandbox backends.

rllm/data/utils.py:
- task_from_row now roots the Task at the row's task_path and merges
  task.toml/Dockerfile metadata, so per-task verifier + image resolution
  work on the training path (fixes "No verifier configured").

rllm/gateway/tinker_adapter.py:
- Translate TerminationEvent into an OpenAI-standard 400
  context_length_exceeded instead of a 500. litellm maps it to a
  non-retryable ContextWindowExceededError, so an over-length in-sandbox
  agent stops immediately instead of retrying until the run-timeout
  SIGKILL — which was stalling group completion and wedging async steps.

rllm/harnesses/mini_swe_agent.py:
- Retry the in-sandbox uv install (Modal->GitHub egress resets the
  connection intermittently and aborted the install under set -e).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gateway): cumulative token mode for the in-process (Tinker) path

`cumulative_token_mode` (drift-free multi-turn token forwarding: turn 2+ is
rewritten to a pre-tokenized prompt built by renderers.bridge_to_next_turn,
avoiding decode→re-encode drift) previously only worked for the HTTP-proxy
(vLLM/verl) path — the cumulative handlers always routed to an HTTP worker's
/v1/completions, so backends that run in-process via `local_handler` (Tinker)
fell back to re-rendering + re-tokenizing the full conversation every turn.

This wires the same feature through the local_handler path, reusing the
existing backend-agnostic TokenAccumulator + Prime Intellect `renderers`
bridge:

- tinker_adapter: the handler now detects a pre-tokenized `prompt` (list[int])
  and samples straight from it via the engine's existing
  get_token_output_from_token_input + assemble_model_output, returning a
  completions-style body (prompt_token_ids + choices[].token_ids) — the shape
  the gateway's cumulative handler already extracts token IDs from. The chat
  (messages) path is unchanged.
- proxy: _handle_cumulative_non_streaming routes to local_handler when present
  (no HTTP worker); added _handle_cumulative_streaming_local to synthesize an
  SSE stream from the single in-process completion (mirrors
  _handle_streaming_local), with the same token ingest.

Net effect: Tinker rollouts get the same prefix-extension guarantee verl has —
turn N's prompt tokens are byte-for-byte prior turns' prompt+completion tokens,
so the sequence the optimizer trains on matches what was generated.

Tests (no Tinker service / model required):
- tests/test_tinker_adapter_cumulative.py — token-prompt path samples from
  tokens; chat path unchanged; non-int prompt falls through.
- rllm-model-gateway/tests/unit/test_cumulative_token_mode_local.py — proxy
  local cumulative non-streaming + streaming ingest + chat translation.
Existing cumulative/accumulator suites (28) still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(swe-rl): enable cumulative token mode + raise context window

Enables the in-process cumulative-token path (cherry-picked from #658) in the
tinker recipe for testing: rllm.gateway.cumulative_token_mode=true +
renderer_family=qwen3.5. Also raises training.max_length=65536 /
data.max_prompt_length=57344 so long mini-swe-agent trajectories fit, and
save_freq=10.

NOTE: the cumulative feature commit on this branch is a cherry-pick of #658 for
local testing — drop it (rebase onto main) once #658 merges to avoid duplication.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(swe-rl): add synchronous tinker training script

train_tinker_sync.sh: on-policy variant of train_tinker.sh for testing —
drops async_training (synchronous generate→train) and uses a real
data.train_batch_size (default 4; effective batch = train_batch_size x
group_size). Same model/sandbox/context/cumulative settings otherwise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(swe-rl): disable cumulative token mode (incompatible with thinking model)

Qwen3.5 is a thinking model (<think>...</think> per assistant turn). The normal
re-render path strips prior-turn reasoning from history (mini-swe-agent stores
the think-stripped message.content), but cumulative mode carries forward the RAW
completion tokens (tinker_engine completion_ids=response_tokens, incl. <think>),
which renderers.bridge_to_next_turn concatenates verbatim. The model then sees an
ever-growing stack of its own prior reasoning — out-of-distribution for multi-turn
— degrading into short, submit-early, zero-reward trajectories.

Renderer mismatch was ruled out: the bridge (PrimeIntellect qwen3.5) and the
engine renderer are token-identical; the bridged prompt is well-formed. The
issue is purely reasoning carry-forward. Cumulative mode is fine for
non-thinking models / single-turn; for thinking models, correct multi-turn
rollout requires stripping prior reasoning (i.e. re-tokenization), which is
fundamentally at odds with exact-token reuse.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(swe-rl): switch to Qwen3.5-4B + terminus2 harness

- agent_flow -> Terminus2Harness (was MiniSweAgentHarness) in train.py;
  docstring/script prose updated accordingly.
- model.name / MODEL_PATH -> Qwen/Qwen3.5-4B; renderer_family -> qwen3.5;
  experiment names -> r2egym-terminus2-qwen3.5-4b[-sync/-verl].
- re-enable rllm.gateway.cumulative_token_mode in the tinker scripts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(swe-rl): bump n_parallel_tasks to 128 (tinker + sync)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant