Dockerfile RUN-replay corrupts Harbor task setup on modal/daytona (line-continuation mangling + double-apply on prebuilt images)

## Summary

On non-docker sandbox backends (`modal`, `daytona`), the cold-path "replay the Dockerfile `RUN` steps" mechanism (`_replay_dockerfile` → `_dockerfile_run_commands` in `rllm/eval/_resolution.py`) corrupts Harbor / terminal-bench task setups in two distinct ways. Combined with the fact that these tasks boot from a **pre-built** per-task image, the replay also **double-applies** the build steps. The result is a flood of `Command failed in sandbox …` setup warnings and an unknown number of spuriously-failed (reward 0) tasks during eval.

Discovered while running `rllm eval <local harbor dir> --agent terminus2 --sandbox-backend modal`.

## Affected path

- `rllm/eval/_resolution.py:199` `_replay_dockerfile` (cold path; called by `_create_sandbox_for_task:213`)
- `rllm/eval/_resolution.py:251` `_dockerfile_run_commands`
- `rllm/eval/_resolution.py:298` `_resolve_image` (returns the configured pre-built image for non-docker backends)
- Snapshot builds are also affected: `build_modal_snapshot` / daytona snapshot build call the same `_replay_dockerfile`, so a snapshot bakes the same broken replay.
- `--sandbox-backend docker` is **not** affected — it builds the Dockerfile natively (`_replay_dockerfile` early-returns for `backend == "docker"`).

## Defect 1 — line-continuation mangling produces invalid shell

`_dockerfile_run_commands` strips the trailing `\` from each continued line and rejoins with `\n` (`rllm/eval/_resolution.py:278`):

```python
cmd = "\n".join(parts).strip()
```

A `\`-continuation in a Dockerfile `RUN` is a *shell line continuation* — the lines form **one** logical command. Joining with `\n` instead of a space turns a valid multi-line `&&` chain into invalid shell. `_replay_dockerfile` then execs the raw multi-line string via `_safe_exec` → `bash -c` (`:209-210`); note it does **not** use the existing `_as_single_run_line:285` helper.

### Reproduction

```python
from rllm.tasks.loader import BenchmarkLoader
from rllm.eval._resolution import _dockerfile_run_commands

br = BenchmarkLoader.load("<harbor tasks dir>", sandbox_backend="modal", harness_name="oracle")
t = next(x for x in br.tasks if x.id == "ansible-debug-project")
for c in _dockerfile_run_commands(t):
    print(repr(c))
```

For a Dockerfile containing:

```dockerfile
RUN apt-get update && apt-get install -y \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*
```

the extracted "replay command" is:

```
apt-get update && apt-get install -y
    python3-pip
    && rm -rf /var/lib/apt/lists/*
```

which `bash` rejects with exactly the error seen in eval logs:

```
bash: -c: line 2: syntax error near unexpected token `&&'
```

## Defect 2 — `COPY`/`ADD` are dropped, and replay double-applies on a pre-built image

`_dockerfile_run_commands` intentionally keeps only `RUN` (docstring: *"only RUN is replayable on a live sandbox"*), dropping `COPY`/`ADD`. So replayed `RUN` steps that depend on copied files fail, e.g.:

```
cc1: fatal error: /tmp/vault.c: No such file or directory
python3: can't open file '/setup/generate_data.py': [Errno 2] No such file or directory
```

Crucially, on `modal`/`daytona` `_resolve_image` returns the **configured pre-built** image (e.g. `docker.io/<org>/optimbench-tb:<task>`), which is the *fully built* Dockerfile (FROM base + all RUN + COPY). Replaying the `RUN` steps on top of it re-runs the build → double-apply. Smoking gun, from a task whose Dockerfile has `RUN git clone https://github.com/JGCRI/hector.git /app/hector`:

```
Command failed in sandbox rllm-hector-carbon-budget-…: git clone https://github.com/JGCRI/hector.git /app/hector
stderr: fatal: destination path '/app/hector' already exists and is not an empty directory.
```

`/app/hector` can only "already exist" if the booted image already contains it — i.e. the image is the built Dockerfile and the replay is redundant.

## Impact

- Hundreds of `Command failed in sandbox …` warnings per eval run.
- Cold-path setup also balloons wall-clock (re-running the full build per task; observed ~15 min/task `setup=` on modal).
- An unknown subset of tasks score a spurious reward 0 when a non-idempotent re-run corrupts already-correct state. (Many tasks still pass because the pre-built image already has the right state, so the bug is often "just" noise — but it is not always harmless.)

## Suggested fixes

1. **Defect 1:** in `_dockerfile_run_commands`, join `\`-continuations with a single space (drop the `\`), matching shell semantics; or route every replay command through `_as_single_run_line`. This alone removes the `syntax error near '&&'` class.
2. **Defect 2 / double-apply:** when `_resolve_image` returns a configured pre-built image (rather than building the Dockerfile), skip `_replay_dockerfile` entirely — the image already contains the `RUN`+`COPY` result. Replay should only run when the booted image is the Dockerfile's `FROM` base. Alternatively, gate replay on a flag and prefer snapshots/native docker build for tasks whose Dockerfile uses `COPY`/`ADD`.

## Environment

- repo `rllm-org/rllm`, branch `terminal-rl`, commit `9833530`
- backend: `modal` (also applies to `daytona`); `docker` backend unaffected
- task format: Harbor / terminal-bench v2 tasks with multi-line `RUN` and/or `COPY` in `environment/Dockerfile`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dockerfile RUN-replay corrupts Harbor task setup on modal/daytona (line-continuation mangling + double-apply on prebuilt images) #655

Summary

Affected path

Defect 1 — line-continuation mangling produces invalid shell

Reproduction

Defect 2 — `COPY`/`ADD` are dropped, and replay double-applies on a pre-built image

Impact

Suggested fixes

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dockerfile RUN-replay corrupts Harbor task setup on modal/daytona (line-continuation mangling + double-apply on prebuilt images) #655

Description

Summary

Affected path

Defect 1 — line-continuation mangling produces invalid shell

Reproduction

Defect 2 — COPY/ADD are dropped, and replay double-applies on a pre-built image

Impact

Suggested fixes

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Defect 2 — `COPY`/`ADD` are dropped, and replay double-applies on a pre-built image