You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- After removing directories from the repo, also clean references from `scripts/sync_agent_guides.py` (`LOCAL_SOURCES`) and `scripts/docs_consistency_check.py` (`LOCAL_AGENT_TARGET_DIRS`).
66
66
67
67
### Daytona / Harbor
68
-
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (exception: pre-built GHCR base images need separate rebuild).
68
+
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (pre-built GHCR images need separate rebuild).
69
69
- Harbor+Daytona (`harbor run --environment-type daytona`) is recommended. `scripts/daytona_runner.py` is for quick validation only.
- Token usage in `trajectory.json`; transcript parsers don't see it. Contract: write `/logs/verifier/reward.txt`.
97
98
98
99
### Security / Credentials
99
-
-**Never pass credentials via Docker `-e` flags.** They leak into trajectory HTML when an agent runs `env`. Use file-based injection: write to `/logs/agent/.credentials.json` with `chmod 600`.
100
-
-`scripts/sanitize_secrets.py` redacts real API keys (Anthropic, OpenAI, Sourcegraph, GitHub, Daytona) at result generation time. Maintains allowlist for known fake benchmark fixtures.
100
+
-**Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101
+
-`scripts/sanitize_secrets.py` redacts real API keys at result generation time. Not yet integrated into `export_official_results.py` (manual invocation required).
102
+
-`sanitize_secrets.py``_FAKE_INDICATORS` substring matching is too broad -- `"example"`, `"test_key"`, `"dummy"` can bypass redaction of real secrets. Use exact-match `FAKE_KEY_ALLOWLIST` instead.
101
103
102
104
### Harness-Agnostic Verifiers
103
-
-**no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for agents that auto-commit (e.g., OpenHands). Otherwise the guard falsely penalizes normal OH behavior.
104
-
- Verifier path fallback chains: use `${TASK_WORKDIR:-/workspace}` for working directory and `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root. Enables same verifier across Harbor and OpenHands.
105
-
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo. The go.work file may require a newer Go version than the container provides.
105
+
-**no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106
+
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107
+
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108
+
- 6 tasks still hardcode `/workspace` without fallbacks: 3 in `csb_sdlc_understand` (document search), 3 in `csb_org_onboarding` (`answer.json`). Zero scores on non-Harbor harnesses.
106
109
107
110
### Validation / Scoring
108
111
-`validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
135
142
136
143
### OpenHands
137
-
-`sandbox_plugins` is a list (not property). Strip ALL plugins (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout on large repos). TOML config has no effect in v1.4.0.
144
+
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout). TOML config has no effect in v1.4.0.
138
145
-`shlex.quote()` breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.
139
-
- Background daemons outlive the main process and hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')` (missing on minimal images).
146
+
- Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
140
147
- Alpine lacks `apt-get` (OH installer requirement). Use `bookworm` variants.
141
148
- OH MCP client has ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
142
-
-`chown -R /workspace` blocks port binding >120s on large repos. Edit installed `runtime_init.py` source -- monkey-patches don't propagate to action_execution_server subprocess.
149
+
-`chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` source directly.
143
150
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages from shadowing installed deps.
144
151
152
+
### CI / Workflows
153
+
-`docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
154
+
- Export HTML silently truncates at 1200 rows (`filtered.slice(0, 1200)` in `export_official_results.py`).
155
+
145
156
### Pre-commit / Pytest / Ralph
146
157
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
147
158
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
- After removing directories from the repo, also clean references from `scripts/sync_agent_guides.py` (`LOCAL_SOURCES`) and `scripts/docs_consistency_check.py` (`LOCAL_AGENT_TARGET_DIRS`).
66
66
67
67
### Daytona / Harbor
68
-
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (exception: pre-built GHCR base images need separate rebuild).
68
+
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (pre-built GHCR images need separate rebuild).
69
69
- Harbor+Daytona (`harbor run --environment-type daytona`) is recommended. `scripts/daytona_runner.py` is for quick validation only.
- Token usage in `trajectory.json`; transcript parsers don't see it. Contract: write `/logs/verifier/reward.txt`.
97
98
98
99
### Security / Credentials
99
-
-**Never pass credentials via Docker `-e` flags.** They leak into trajectory HTML when an agent runs `env`. Use file-based injection: write to `/logs/agent/.credentials.json` with `chmod 600`.
100
-
-`scripts/sanitize_secrets.py` redacts real API keys (Anthropic, OpenAI, Sourcegraph, GitHub, Daytona) at result generation time. Maintains allowlist for known fake benchmark fixtures.
100
+
-**Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101
+
-`scripts/sanitize_secrets.py` redacts real API keys at result generation time. Not yet integrated into `export_official_results.py` (manual invocation required).
102
+
-`sanitize_secrets.py``_FAKE_INDICATORS` substring matching is too broad -- `"example"`, `"test_key"`, `"dummy"` can bypass redaction of real secrets. Use exact-match `FAKE_KEY_ALLOWLIST` instead.
101
103
102
104
### Harness-Agnostic Verifiers
103
-
-**no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for agents that auto-commit (e.g., OpenHands). Otherwise the guard falsely penalizes normal OH behavior.
104
-
- Verifier path fallback chains: use `${TASK_WORKDIR:-/workspace}` for working directory and `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root. Enables same verifier across Harbor and OpenHands.
105
-
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo. The go.work file may require a newer Go version than the container provides.
105
+
-**no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106
+
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107
+
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108
+
- 6 tasks still hardcode `/workspace` without fallbacks: 3 in `csb_sdlc_understand` (document search), 3 in `csb_org_onboarding` (`answer.json`). Zero scores on non-Harbor harnesses.
106
109
107
110
### Validation / Scoring
108
111
-`validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
135
142
136
143
### OpenHands
137
-
-`sandbox_plugins` is a list (not property). Strip ALL plugins (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout on large repos). TOML config has no effect in v1.4.0.
144
+
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout). TOML config has no effect in v1.4.0.
138
145
-`shlex.quote()` breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.
139
-
- Background daemons outlive the main process and hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')` (missing on minimal images).
146
+
- Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
140
147
- Alpine lacks `apt-get` (OH installer requirement). Use `bookworm` variants.
141
148
- OH MCP client has ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
142
-
-`chown -R /workspace` blocks port binding >120s on large repos. Edit installed `runtime_init.py` source -- monkey-patches don't propagate to action_execution_server subprocess.
149
+
-`chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` source directly.
143
150
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages from shadowing installed deps.
144
151
152
+
### CI / Workflows
153
+
-`docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
154
+
- Export HTML silently truncates at 1200 rows (`filtered.slice(0, 1200)` in `export_official_results.py`).
155
+
145
156
### Pre-commit / Pytest / Ralph
146
157
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
147
158
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
0 commit comments