Skip to content

Commit 2db0049

Browse files
committed
docs: add learnings from nightly report 2026-03-11
New gotchas: sanitize_secrets allowlist bypass, jefzda/ migration incomplete, 6 tasks missing TASK_WORKDIR fallbacks, pass rate logic duplication, repo_health fallback gaps, json.load fd leaks, CI workflow redundancy, export HTML truncation.
1 parent 3381ddf commit 2db0049

File tree

3 files changed

+69
-36
lines changed

3 files changed

+69
-36
lines changed

AGENTS.md

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
6565
- After removing directories from the repo, also clean references from `scripts/sync_agent_guides.py` (`LOCAL_SOURCES`) and `scripts/docs_consistency_check.py` (`LOCAL_AGENT_TARGET_DIRS`).
6666

6767
### Daytona / Harbor
68-
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (exception: pre-built GHCR base images need separate rebuild).
68+
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (pre-built GHCR images need separate rebuild).
6969
- Harbor+Daytona (`harbor run --environment-type daytona`) is recommended. `scripts/daytona_runner.py` is for quick validation only.
7070
- `BASELINE_MCP_TYPE` env var: `none`, `sourcegraph`, `deepsearch`.
7171
- Use Daytona SDK (`daytona_sdk`) over CLI (CLI is interactive-only for SSH).
@@ -75,11 +75,12 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7575
- Registry types enum: `internal`, `organization`, `transient`, `backup`. Use `organization` for GHCR/Docker Hub.
7676

7777
### Docker / Build
78-
- `uv tool install` segfaults on ARM64/QEMU emulation. Use `pip install` instead, or switch to Daytona (native x86_64).
79-
- Build-push-clean pattern when building Docker images with limited disk (~45GB): build one image, push, then clean locally before the next.
80-
- Colons in agent names (e.g., `module:ClassName`) break Docker volume mounts. Sanitize paths: replace `:` with `__`.
78+
- `uv tool install` segfaults on ARM64/QEMU. Use `pip install` or Daytona (native x86_64).
79+
- Build-push-clean pattern for limited disk (~45GB): build, push, clean before next image.
80+
- Colons in agent names break Docker volume mounts. Replace `:` with `__`.
8181
- Add `|| git init` fallback to all `git clone` commands in Dockerfiles for network resilience. Applied to 269 Dockerfiles.
8282
- Add `chown claude:claude /logs` and `adduser claude` to Dockerfiles for cross-harness (OH) permission compatibility.
83+
- `jefzda/``ghcr.io/sg-evals/` migration incomplete: 33 active Dockerfiles in `csb/debug/` and `csb/fix/` still reference `jefzda/`.
8384

8485
### MCP Configuration (inside sandboxes)
8586
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`.
@@ -96,13 +97,15 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
9697
- Token usage in `trajectory.json`; transcript parsers don't see it. Contract: write `/logs/verifier/reward.txt`.
9798

9899
### Security / Credentials
99-
- **Never pass credentials via Docker `-e` flags.** They leak into trajectory HTML when an agent runs `env`. Use file-based injection: write to `/logs/agent/.credentials.json` with `chmod 600`.
100-
- `scripts/sanitize_secrets.py` redacts real API keys (Anthropic, OpenAI, Sourcegraph, GitHub, Daytona) at result generation time. Maintains allowlist for known fake benchmark fixtures.
100+
- **Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101+
- `scripts/sanitize_secrets.py` redacts real API keys at result generation time. Not yet integrated into `export_official_results.py` (manual invocation required).
102+
- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching is too broad -- `"example"`, `"test_key"`, `"dummy"` can bypass redaction of real secrets. Use exact-match `FAKE_KEY_ALLOWLIST` instead.
101103

102104
### Harness-Agnostic Verifiers
103-
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for agents that auto-commit (e.g., OpenHands). Otherwise the guard falsely penalizes normal OH behavior.
104-
- Verifier path fallback chains: use `${TASK_WORKDIR:-/workspace}` for working directory and `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root. Enables same verifier across Harbor and OpenHands.
105-
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo. The go.work file may require a newer Go version than the container provides.
105+
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106+
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107+
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108+
- 6 tasks still hardcode `/workspace` without fallbacks: 3 in `csb_sdlc_understand` (document search), 3 in `csb_org_onboarding` (`answer.json`). Zero scores on non-Harbor harnesses.
106109

107110
### Validation / Scoring
108111
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
@@ -114,6 +117,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
114117
- **CSB dual-score**: file edits + `answer.json` scored independently. Fallback: `promoted_verifier.py` -> `oracle_checks.py` -> heuristic.
115118
- Rate-limited results (score=0, <30s): `scripts/quarantine_invalid_tasks.py --execute`.
116119
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
120+
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
121+
- `repo_health.py` fallback dict missing `prompt_hygiene` and `launch_policy` checks (3 of 5 only).
122+
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
117123

118124
### Git / Auth
119125
- `gh auth refresh` needs explicit `-s <scope>`: `gh auth refresh -h github.com -s write:packages`.
@@ -126,6 +132,7 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
126132
### Python / Subprocess
127133
- `dict.get(key, default)` does NOT protect against `None` values. Use `data.get("key") or default_value`.
128134
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the handle. Use `open()` without context manager for long-running subprocesses.
135+
- `json.load(open(path))` leaks file descriptors. Use `with open(path) as f: json.load(f)`. Affects 12 scripts.
129136
- macOS Bash 3.2 lacks `declare -A`. Use pipe-delimited strings with `IFS='|' read -r`.
130137

131138
### LLM Judge
@@ -134,14 +141,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
134141
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
135142

136143
### OpenHands
137-
- `sandbox_plugins` is a list (not property). Strip ALL plugins (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout on large repos). TOML config has no effect in v1.4.0.
144+
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout). TOML config has no effect in v1.4.0.
138145
- `shlex.quote()` breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.
139-
- Background daemons outlive the main process and hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')` (missing on minimal images).
146+
- Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
140147
- Alpine lacks `apt-get` (OH installer requirement). Use `bookworm` variants.
141148
- OH MCP client has ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
142-
- `chown -R /workspace` blocks port binding >120s on large repos. Edit installed `runtime_init.py` source -- monkey-patches don't propagate to action_execution_server subprocess.
149+
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` source directly.
143150
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages from shadowing installed deps.
144151

152+
### CI / Workflows
153+
- `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
154+
- Export HTML silently truncates at 1200 rows (`filtered.slice(0, 1200)` in `export_official_results.py`).
155+
145156
### Pre-commit / Pytest / Ralph
146157
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
147158
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.

CLAUDE.md

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
6565
- After removing directories from the repo, also clean references from `scripts/sync_agent_guides.py` (`LOCAL_SOURCES`) and `scripts/docs_consistency_check.py` (`LOCAL_AGENT_TARGET_DIRS`).
6666

6767
### Daytona / Harbor
68-
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (exception: pre-built GHCR base images need separate rebuild).
68+
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (pre-built GHCR images need separate rebuild).
6969
- Harbor+Daytona (`harbor run --environment-type daytona`) is recommended. `scripts/daytona_runner.py` is for quick validation only.
7070
- `BASELINE_MCP_TYPE` env var: `none`, `sourcegraph`, `deepsearch`.
7171
- Use Daytona SDK (`daytona_sdk`) over CLI (CLI is interactive-only for SSH).
@@ -75,11 +75,12 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7575
- Registry types enum: `internal`, `organization`, `transient`, `backup`. Use `organization` for GHCR/Docker Hub.
7676

7777
### Docker / Build
78-
- `uv tool install` segfaults on ARM64/QEMU emulation. Use `pip install` instead, or switch to Daytona (native x86_64).
79-
- Build-push-clean pattern when building Docker images with limited disk (~45GB): build one image, push, then clean locally before the next.
80-
- Colons in agent names (e.g., `module:ClassName`) break Docker volume mounts. Sanitize paths: replace `:` with `__`.
78+
- `uv tool install` segfaults on ARM64/QEMU. Use `pip install` or Daytona (native x86_64).
79+
- Build-push-clean pattern for limited disk (~45GB): build, push, clean before next image.
80+
- Colons in agent names break Docker volume mounts. Replace `:` with `__`.
8181
- Add `|| git init` fallback to all `git clone` commands in Dockerfiles for network resilience. Applied to 269 Dockerfiles.
8282
- Add `chown claude:claude /logs` and `adduser claude` to Dockerfiles for cross-harness (OH) permission compatibility.
83+
- `jefzda/``ghcr.io/sg-evals/` migration incomplete: 33 active Dockerfiles in `csb/debug/` and `csb/fix/` still reference `jefzda/`.
8384

8485
### MCP Configuration (inside sandboxes)
8586
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`.
@@ -96,13 +97,15 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
9697
- Token usage in `trajectory.json`; transcript parsers don't see it. Contract: write `/logs/verifier/reward.txt`.
9798

9899
### Security / Credentials
99-
- **Never pass credentials via Docker `-e` flags.** They leak into trajectory HTML when an agent runs `env`. Use file-based injection: write to `/logs/agent/.credentials.json` with `chmod 600`.
100-
- `scripts/sanitize_secrets.py` redacts real API keys (Anthropic, OpenAI, Sourcegraph, GitHub, Daytona) at result generation time. Maintains allowlist for known fake benchmark fixtures.
100+
- **Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101+
- `scripts/sanitize_secrets.py` redacts real API keys at result generation time. Not yet integrated into `export_official_results.py` (manual invocation required).
102+
- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching is too broad -- `"example"`, `"test_key"`, `"dummy"` can bypass redaction of real secrets. Use exact-match `FAKE_KEY_ALLOWLIST` instead.
101103

102104
### Harness-Agnostic Verifiers
103-
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for agents that auto-commit (e.g., OpenHands). Otherwise the guard falsely penalizes normal OH behavior.
104-
- Verifier path fallback chains: use `${TASK_WORKDIR:-/workspace}` for working directory and `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root. Enables same verifier across Harbor and OpenHands.
105-
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo. The go.work file may require a newer Go version than the container provides.
105+
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106+
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107+
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108+
- 6 tasks still hardcode `/workspace` without fallbacks: 3 in `csb_sdlc_understand` (document search), 3 in `csb_org_onboarding` (`answer.json`). Zero scores on non-Harbor harnesses.
106109

107110
### Validation / Scoring
108111
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
@@ -114,6 +117,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
114117
- **CSB dual-score**: file edits + `answer.json` scored independently. Fallback: `promoted_verifier.py` -> `oracle_checks.py` -> heuristic.
115118
- Rate-limited results (score=0, <30s): `scripts/quarantine_invalid_tasks.py --execute`.
116119
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
120+
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
121+
- `repo_health.py` fallback dict missing `prompt_hygiene` and `launch_policy` checks (3 of 5 only).
122+
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
117123

118124
### Git / Auth
119125
- `gh auth refresh` needs explicit `-s <scope>`: `gh auth refresh -h github.com -s write:packages`.
@@ -126,6 +132,7 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
126132
### Python / Subprocess
127133
- `dict.get(key, default)` does NOT protect against `None` values. Use `data.get("key") or default_value`.
128134
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the handle. Use `open()` without context manager for long-running subprocesses.
135+
- `json.load(open(path))` leaks file descriptors. Use `with open(path) as f: json.load(f)`. Affects 12 scripts.
129136
- macOS Bash 3.2 lacks `declare -A`. Use pipe-delimited strings with `IFS='|' read -r`.
130137

131138
### LLM Judge
@@ -134,14 +141,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
134141
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
135142

136143
### OpenHands
137-
- `sandbox_plugins` is a list (not property). Strip ALL plugins (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout on large repos). TOML config has no effect in v1.4.0.
144+
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout). TOML config has no effect in v1.4.0.
138145
- `shlex.quote()` breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.
139-
- Background daemons outlive the main process and hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')` (missing on minimal images).
146+
- Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
140147
- Alpine lacks `apt-get` (OH installer requirement). Use `bookworm` variants.
141148
- OH MCP client has ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
142-
- `chown -R /workspace` blocks port binding >120s on large repos. Edit installed `runtime_init.py` source -- monkey-patches don't propagate to action_execution_server subprocess.
149+
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` source directly.
143150
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages from shadowing installed deps.
144151

152+
### CI / Workflows
153+
- `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
154+
- Export HTML silently truncates at 1200 rows (`filtered.slice(0, 1200)` in `export_official_results.py`).
155+
145156
### Pre-commit / Pytest / Ralph
146157
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
147158
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.

0 commit comments

Comments
 (0)