Summary
On non-docker sandbox backends (modal, daytona), the cold-path "replay the Dockerfile RUN steps" mechanism (_replay_dockerfile → _dockerfile_run_commands in rllm/eval/_resolution.py) corrupts Harbor / terminal-bench task setups in two distinct ways. Combined with the fact that these tasks boot from a pre-built per-task image, the replay also double-applies the build steps. The result is a flood of Command failed in sandbox … setup warnings and an unknown number of spuriously-failed (reward 0) tasks during eval.
Discovered while running rllm eval <local harbor dir> --agent terminus2 --sandbox-backend modal.
Affected path
rllm/eval/_resolution.py:199 _replay_dockerfile (cold path; called by _create_sandbox_for_task:213)
rllm/eval/_resolution.py:251 _dockerfile_run_commands
rllm/eval/_resolution.py:298 _resolve_image (returns the configured pre-built image for non-docker backends)
- Snapshot builds are also affected:
build_modal_snapshot / daytona snapshot build call the same _replay_dockerfile, so a snapshot bakes the same broken replay.
--sandbox-backend docker is not affected — it builds the Dockerfile natively (_replay_dockerfile early-returns for backend == "docker").
Defect 1 — line-continuation mangling produces invalid shell
_dockerfile_run_commands strips the trailing \ from each continued line and rejoins with \n (rllm/eval/_resolution.py:278):
cmd = "\n".join(parts).strip()
A \-continuation in a Dockerfile RUN is a shell line continuation — the lines form one logical command. Joining with \n instead of a space turns a valid multi-line && chain into invalid shell. _replay_dockerfile then execs the raw multi-line string via _safe_exec → bash -c (:209-210); note it does not use the existing _as_single_run_line:285 helper.
Reproduction
from rllm.tasks.loader import BenchmarkLoader
from rllm.eval._resolution import _dockerfile_run_commands
br = BenchmarkLoader.load("<harbor tasks dir>", sandbox_backend="modal", harness_name="oracle")
t = next(x for x in br.tasks if x.id == "ansible-debug-project")
for c in _dockerfile_run_commands(t):
print(repr(c))
For a Dockerfile containing:
RUN apt-get update && apt-get install -y \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
the extracted "replay command" is:
apt-get update && apt-get install -y
python3-pip
&& rm -rf /var/lib/apt/lists/*
which bash rejects with exactly the error seen in eval logs:
bash: -c: line 2: syntax error near unexpected token `&&'
Defect 2 — COPY/ADD are dropped, and replay double-applies on a pre-built image
_dockerfile_run_commands intentionally keeps only RUN (docstring: "only RUN is replayable on a live sandbox"), dropping COPY/ADD. So replayed RUN steps that depend on copied files fail, e.g.:
cc1: fatal error: /tmp/vault.c: No such file or directory
python3: can't open file '/setup/generate_data.py': [Errno 2] No such file or directory
Crucially, on modal/daytona _resolve_image returns the configured pre-built image (e.g. docker.io/<org>/optimbench-tb:<task>), which is the fully built Dockerfile (FROM base + all RUN + COPY). Replaying the RUN steps on top of it re-runs the build → double-apply. Smoking gun, from a task whose Dockerfile has RUN git clone https://github.com/JGCRI/hector.git /app/hector:
Command failed in sandbox rllm-hector-carbon-budget-…: git clone https://github.com/JGCRI/hector.git /app/hector
stderr: fatal: destination path '/app/hector' already exists and is not an empty directory.
/app/hector can only "already exist" if the booted image already contains it — i.e. the image is the built Dockerfile and the replay is redundant.
Impact
- Hundreds of
Command failed in sandbox … warnings per eval run.
- Cold-path setup also balloons wall-clock (re-running the full build per task; observed ~15 min/task
setup= on modal).
- An unknown subset of tasks score a spurious reward 0 when a non-idempotent re-run corrupts already-correct state. (Many tasks still pass because the pre-built image already has the right state, so the bug is often "just" noise — but it is not always harmless.)
Suggested fixes
- Defect 1: in
_dockerfile_run_commands, join \-continuations with a single space (drop the \), matching shell semantics; or route every replay command through _as_single_run_line. This alone removes the syntax error near '&&' class.
- Defect 2 / double-apply: when
_resolve_image returns a configured pre-built image (rather than building the Dockerfile), skip _replay_dockerfile entirely — the image already contains the RUN+COPY result. Replay should only run when the booted image is the Dockerfile's FROM base. Alternatively, gate replay on a flag and prefer snapshots/native docker build for tasks whose Dockerfile uses COPY/ADD.
Environment
- repo
rllm-org/rllm, branch terminal-rl, commit 9833530
- backend:
modal (also applies to daytona); docker backend unaffected
- task format: Harbor / terminal-bench v2 tasks with multi-line
RUN and/or COPY in environment/Dockerfile
Summary
On non-docker sandbox backends (
modal,daytona), the cold-path "replay the DockerfileRUNsteps" mechanism (_replay_dockerfile→_dockerfile_run_commandsinrllm/eval/_resolution.py) corrupts Harbor / terminal-bench task setups in two distinct ways. Combined with the fact that these tasks boot from a pre-built per-task image, the replay also double-applies the build steps. The result is a flood ofCommand failed in sandbox …setup warnings and an unknown number of spuriously-failed (reward 0) tasks during eval.Discovered while running
rllm eval <local harbor dir> --agent terminus2 --sandbox-backend modal.Affected path
rllm/eval/_resolution.py:199_replay_dockerfile(cold path; called by_create_sandbox_for_task:213)rllm/eval/_resolution.py:251_dockerfile_run_commandsrllm/eval/_resolution.py:298_resolve_image(returns the configured pre-built image for non-docker backends)build_modal_snapshot/ daytona snapshot build call the same_replay_dockerfile, so a snapshot bakes the same broken replay.--sandbox-backend dockeris not affected — it builds the Dockerfile natively (_replay_dockerfileearly-returns forbackend == "docker").Defect 1 — line-continuation mangling produces invalid shell
_dockerfile_run_commandsstrips the trailing\from each continued line and rejoins with\n(rllm/eval/_resolution.py:278):A
\-continuation in a DockerfileRUNis a shell line continuation — the lines form one logical command. Joining with\ninstead of a space turns a valid multi-line&&chain into invalid shell._replay_dockerfilethen execs the raw multi-line string via_safe_exec→bash -c(:209-210); note it does not use the existing_as_single_run_line:285helper.Reproduction
For a Dockerfile containing:
RUN apt-get update && apt-get install -y \ python3-pip \ && rm -rf /var/lib/apt/lists/*the extracted "replay command" is:
which
bashrejects with exactly the error seen in eval logs:Defect 2 —
COPY/ADDare dropped, and replay double-applies on a pre-built image_dockerfile_run_commandsintentionally keeps onlyRUN(docstring: "only RUN is replayable on a live sandbox"), droppingCOPY/ADD. So replayedRUNsteps that depend on copied files fail, e.g.:Crucially, on
modal/daytona_resolve_imagereturns the configured pre-built image (e.g.docker.io/<org>/optimbench-tb:<task>), which is the fully built Dockerfile (FROM base + all RUN + COPY). Replaying theRUNsteps on top of it re-runs the build → double-apply. Smoking gun, from a task whose Dockerfile hasRUN git clone https://github.com/JGCRI/hector.git /app/hector:/app/hectorcan only "already exist" if the booted image already contains it — i.e. the image is the built Dockerfile and the replay is redundant.Impact
Command failed in sandbox …warnings per eval run.setup=on modal).Suggested fixes
_dockerfile_run_commands, join\-continuations with a single space (drop the\), matching shell semantics; or route every replay command through_as_single_run_line. This alone removes thesyntax error near '&&'class._resolve_imagereturns a configured pre-built image (rather than building the Dockerfile), skip_replay_dockerfileentirely — the image already contains theRUN+COPYresult. Replay should only run when the booted image is the Dockerfile'sFROMbase. Alternatively, gate replay on a flag and prefer snapshots/native docker build for tasks whose Dockerfile usesCOPY/ADD.Environment
rllm-org/rllm, branchterminal-rl, commit9833530modal(also applies todaytona);dockerbackend unaffectedRUNand/orCOPYinenvironment/Dockerfile