Skip to content

Dockerfile RUN-replay corrupts Harbor task setup on modal/daytona (line-continuation mangling + double-apply on prebuilt images) #655

@signalrush

Description

@signalrush

Summary

On non-docker sandbox backends (modal, daytona), the cold-path "replay the Dockerfile RUN steps" mechanism (_replay_dockerfile_dockerfile_run_commands in rllm/eval/_resolution.py) corrupts Harbor / terminal-bench task setups in two distinct ways. Combined with the fact that these tasks boot from a pre-built per-task image, the replay also double-applies the build steps. The result is a flood of Command failed in sandbox … setup warnings and an unknown number of spuriously-failed (reward 0) tasks during eval.

Discovered while running rllm eval <local harbor dir> --agent terminus2 --sandbox-backend modal.

Affected path

  • rllm/eval/_resolution.py:199 _replay_dockerfile (cold path; called by _create_sandbox_for_task:213)
  • rllm/eval/_resolution.py:251 _dockerfile_run_commands
  • rllm/eval/_resolution.py:298 _resolve_image (returns the configured pre-built image for non-docker backends)
  • Snapshot builds are also affected: build_modal_snapshot / daytona snapshot build call the same _replay_dockerfile, so a snapshot bakes the same broken replay.
  • --sandbox-backend docker is not affected — it builds the Dockerfile natively (_replay_dockerfile early-returns for backend == "docker").

Defect 1 — line-continuation mangling produces invalid shell

_dockerfile_run_commands strips the trailing \ from each continued line and rejoins with \n (rllm/eval/_resolution.py:278):

cmd = "\n".join(parts).strip()

A \-continuation in a Dockerfile RUN is a shell line continuation — the lines form one logical command. Joining with \n instead of a space turns a valid multi-line && chain into invalid shell. _replay_dockerfile then execs the raw multi-line string via _safe_execbash -c (:209-210); note it does not use the existing _as_single_run_line:285 helper.

Reproduction

from rllm.tasks.loader import BenchmarkLoader
from rllm.eval._resolution import _dockerfile_run_commands

br = BenchmarkLoader.load("<harbor tasks dir>", sandbox_backend="modal", harness_name="oracle")
t = next(x for x in br.tasks if x.id == "ansible-debug-project")
for c in _dockerfile_run_commands(t):
    print(repr(c))

For a Dockerfile containing:

RUN apt-get update && apt-get install -y \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

the extracted "replay command" is:

apt-get update && apt-get install -y
    python3-pip
    && rm -rf /var/lib/apt/lists/*

which bash rejects with exactly the error seen in eval logs:

bash: -c: line 2: syntax error near unexpected token `&&'

Defect 2 — COPY/ADD are dropped, and replay double-applies on a pre-built image

_dockerfile_run_commands intentionally keeps only RUN (docstring: "only RUN is replayable on a live sandbox"), dropping COPY/ADD. So replayed RUN steps that depend on copied files fail, e.g.:

cc1: fatal error: /tmp/vault.c: No such file or directory
python3: can't open file '/setup/generate_data.py': [Errno 2] No such file or directory

Crucially, on modal/daytona _resolve_image returns the configured pre-built image (e.g. docker.io/<org>/optimbench-tb:<task>), which is the fully built Dockerfile (FROM base + all RUN + COPY). Replaying the RUN steps on top of it re-runs the build → double-apply. Smoking gun, from a task whose Dockerfile has RUN git clone https://github.com/JGCRI/hector.git /app/hector:

Command failed in sandbox rllm-hector-carbon-budget-…: git clone https://github.com/JGCRI/hector.git /app/hector
stderr: fatal: destination path '/app/hector' already exists and is not an empty directory.

/app/hector can only "already exist" if the booted image already contains it — i.e. the image is the built Dockerfile and the replay is redundant.

Impact

  • Hundreds of Command failed in sandbox … warnings per eval run.
  • Cold-path setup also balloons wall-clock (re-running the full build per task; observed ~15 min/task setup= on modal).
  • An unknown subset of tasks score a spurious reward 0 when a non-idempotent re-run corrupts already-correct state. (Many tasks still pass because the pre-built image already has the right state, so the bug is often "just" noise — but it is not always harmless.)

Suggested fixes

  1. Defect 1: in _dockerfile_run_commands, join \-continuations with a single space (drop the \), matching shell semantics; or route every replay command through _as_single_run_line. This alone removes the syntax error near '&&' class.
  2. Defect 2 / double-apply: when _resolve_image returns a configured pre-built image (rather than building the Dockerfile), skip _replay_dockerfile entirely — the image already contains the RUN+COPY result. Replay should only run when the booted image is the Dockerfile's FROM base. Alternatively, gate replay on a flag and prefer snapshots/native docker build for tasks whose Dockerfile uses COPY/ADD.

Environment

  • repo rllm-org/rllm, branch terminal-rl, commit 9833530
  • backend: modal (also applies to daytona); docker backend unaffected
  • task format: Harbor / terminal-bench v2 tasks with multi-line RUN and/or COPY in environment/Dockerfile

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions