Comprehensive evaluation of benchmark run traces: data integrity validation, output quality assessment, and efficiency analysis across configurations and benchmarks.
- evaluate traces, analyze traces, review benchmark results, audit traces, check eval results, analyze run, benchmark evaluation, evaluate run, trace quality, run quality
When triggered, perform a comprehensive trace evaluation covering three dimensions: (1) data integrity, (2) output quality, and (3) efficiency. Accept optional arguments to scope the analysis (e.g., specific suite, config, or run).
If the user provides arguments (e.g., evaluate traces for SWE-bench Pro SG_full), use those to scope. Otherwise, evaluate ALL official runs.
Key paths and tools:
- MANIFEST:
runs/official/MANIFEST.json(symlink to evals dir — not in git) - Audit script:
python3 scripts/audit_traces.py [--json] [--suite X] [--config X] - Manifest generator:
python3 scripts/generate_manifest.py - Run directory:
runs/official/ - Configs:
baseline,sourcegraph_full
Critical technical details:
runs/officialis a symlink — use the real path forfindcommands- MCP tool names in traces may have
sg_prefix (mcp__sourcegraph__sg_keyword_search) OR not (mcp__sourcegraph__keyword_search) depending on batch vintage. Always check for both patterns. - Batch timestamp dirs use pattern
YYYY-MM-DD__HH-MM-SSwith__separator — don't confuse with task dirs (task_name__hash) - Task-level result.json has full data; batch-level result.json only has aggregate stats
- Transcript vs trajectory:
claude-code.txtincludes ALL tool calls (including Task subagent MCP calls).trajectory.jsononly has main-agent calls and UNDERCOUNTS MCP usage. Always prefer transcript for tool counting.
Launch parallel subagents to check data integrity:
For each SG config task in MANIFEST:
- Read
agent/claude-code.txt(NOT trajectory.json) and countmcp__sourcegraphtool_use invocations - Check both
sg_prefix and non-prefix tool name variants - Verify SG_full tasks use MCP tools (should be ~100% of scored tasks)
- Verify SG_full tasks use MCP tools AND check Deep Search (
deepsearch) adoption - Classify tasks into used-MCP vs zero-MCP groups:
- Used-MCP: At least 1 MCP tool call in transcript. Further classify by intensity:
- Light (1-5 calls): Minimal MCP usage, likely spot checks
- Moderate (6-20 calls): Regular MCP usage during exploration
- Heavy (20+ calls): MCP-centric workflow
- Zero-MCP: MCP available but 0 calls. Classify reason:
- Trivially local (DependEval dependency_recognition — all data in local files)
- Explicit file list (CodeReview — instructions name exact files)
- Full local codebase (SWE-Perf — complete repo in container)
- Both configs failed (neither BL nor SF produced useful output)
- Agent confusion (needs transcript investigation)
- Used-MCP: At least 1 MCP tool call in transcript. Further classify by intensity:
- Report zero-MCP rate per benchmark. High zero-MCP (>30%) suggests MCP isn't valuable for that task type.
- Important: Compute reward/time deltas SEPARATELY for used-MCP and zero-MCP groups. Mixing them dilutes the signal (zero-MCP tasks add preamble overhead with no MCP benefit).
- Scan ALL baseline
claude-code.txtfor anymcp__sourcegraphtool calls (should be 0) - Check baseline
instruction.txtfor Sourcegraph/MCP references (cosmetic, not functional)
- Zero-token tasks:
n_input_tokens=0, n_output_tokens=0→ auth failures (agent never ran) - Crash failures:
n_input/n_output=null, no trajectory,<=5 claude-code.txt lines→ Docker/Node.js crash - Null-token H3 bug:
nulltokens but agent ran fine (50+ cc_lines, valid rewards) — NOT failures - Exceptions:
AgentSetupTimeoutError,RuntimeErrorin result.json - Setup failures: Non-zero return code in
agent/setup/stdout.txt
- Verify MANIFEST dedup prefers non-zero-token results over zero-token results
- Check for auth-failed runs that may corrupt scores via timestamp-based dedup
- Regenerate MANIFEST if issues found:
python3 scripts/generate_manifest.py
Read the MANIFEST and compute quality metrics:
For each benchmark suite, report:
| Suite | Config | Tasks | Scored | Errored | Mean Reward | Pass Rate | Delta vs BL |
|---|
Where:
- Scored = tasks - errored (infra failures excluded from mean)
- Pass Rate = passed / scored
- Delta vs BL = suite mean under config minus suite mean under baseline
For fair comparison, compute metrics only on tasks that ran successfully across ALL configs:
- Identify intersection of scored tasks per suite across baseline, SG_full
- Compute matched-task means for each config
- Report which tasks flipped outcome (pass→fail or fail→pass) between configs
Identify and report:
- MCP helps: Tasks where SG configs improve reward over baseline
- MCP hurts: Tasks where SG configs decrease reward
- MCP neutral: No change
- Full helps only: SG_full improves over baseline (richer context tooling value)
- Persistent failures: Tasks that fail across ALL configs (task difficulty, not config issue)
- Config-specific failures: Tasks that fail only in one config (investigate MCP distraction effect)
Group findings by benchmark type:
- Search-heavy (K8s Docs, LargeRepo, LoCoBench): MCP should help with efficiency
- Implementation-heavy (TAC, SWE-Perf, PyTorch): MCP may distract from coding
- Mixed (SWE-bench Pro, CrossRepo): Variable MCP impact
- Local-only (DependEval, DIBench, RepoQA): MCP provides little value
Extract efficiency metrics from result.json and traces:
For each suite × config, compute:
- Mean input tokens, output tokens, cache tokens
- Total cost estimate (input × $15/M + output × $75/M for Opus)
- Token ratio: cache_tokens / input_tokens (cache efficiency)
From started_at / finished_at in result.json:
- Mean wall clock seconds per task
- Wall clock delta: SG configs vs baseline (positive = slower)
- Identify suites where MCP is faster (LargeRepo, K8s Docs typically)
For SG configs, report tool usage breakdown:
- Top tools by call count and task coverage
keyword_searchtypically dominates (~40-50% of calls)read_fileandlist_filesare 2nd/3rd- Deep Search actual usage (tool_use events, not init listings)
- Unused tools:
go_to_definition,get_contributor_repostypically unused - Preamble overhead: Zero-MCP tasks still incur ~26% time and ~40% cost overhead from preamble injection. Factor this into cost-effectiveness calculations.
For deeper MCP-conditioned analysis, use /mcp-audit which pairs tasks and computes deltas separately for used-MCP vs zero-MCP groups.
Compute cost per unit of reward:
- Cost per scored task = total_cost / scored_tasks
- Cost per reward point = total_cost / total_reward
- MCP overhead = (SG_cost - BL_cost) / BL_cost
- Value ratio: reward_delta / cost_delta (is the MCP cost justified?)
Produce a structured report with:
- Executive Summary: 3-5 bullet points on key findings
- Data Quality: Pass/fail status for each integrity check
- Corrected Scores: Per-suite × per-config table with errored tasks excluded
- Weighted Averages: Overall mean across all suites per config
- MCP Value Assessment: Where MCP helps (efficiency), where it hurts (distraction), where neutral
- Efficiency Comparison: Cost and speed table by config
- Recommendations: Actionable items (reruns needed, config changes, investigation items)
Write the report to docs/TRACE_AUDIT_<date>.md.
After presenting results, offer:
- Create beads issues for any rerun needs or investigation items
- Regenerate MANIFEST if data corrections were applied
- Update MEMORY.md with key findings for future sessions
| File | Path | Contents |
|---|---|---|
| MANIFEST | runs/official/MANIFEST.json |
Canonical run tracking (suite/config → task results) |
| Batch result | <config>/<datetime>/result.json |
Aggregate timing and counts |
| Task result | <config>/<datetime>/<task__hash>/result.json |
Reward, tokens, exceptions |
| Agent trace | <task__hash>/agent/claude-code.txt |
Full Claude Code JSONL transcript |
| Trajectory | <task__hash>/agent/trajectory.json |
Structured step log |
| Instructions | <task__hash>/agent/instruction.txt |
Instructions given to agent |
| CLAUDE.md | <task__hash>/agent/CLAUDE.md |
Preamble + workspace config |
| Script | Usage |
|---|---|
scripts/audit_traces.py |
Trace audit: tool counts, MCP adoption, errors, compliance |
scripts/mcp_audit.py |
MCP-conditioned paired analysis: used vs zero-MCP, intensity buckets |
scripts/generate_manifest.py |
Rebuild MANIFEST from on-disk results |
scripts/aggregate_status.py |
Run scanner with error fingerprinting |
scripts/compare_configs.py |
Cross-config divergence analysis |
scripts/cost_report.py |
Token usage and cost aggregation |
scripts/reextract_all_metrics.py |
Batch re-extract task_metrics.json after bug fixes |
- Zero-token (int 0): Auth failures — agent started but auth failed. Exactly 3 claude-code.txt lines.
- Null-token + no trajectory + <=5 lines: Crash failures (protonmail Node v16, openlibrary gpg)
- Null-token + valid rewards: H3 token-logging bug — agent ran fine, just tokens not recorded
- MCP distraction on TAC: MCP overuse on implementation tasks can reduce scores
- Deep Search unused: Only ~1% of SG_full tasks actually invoke deepsearch (agent prefers sync tools)
- SWE-Perf regression: MCP can hurt SWE-Perf (performance tasks need focused coding, not search)
- Subagent MCP calls hidden: Task subagent MCP calls only appear in
claude-code.txt, NOTtrajectory.json. ~11 tasks had hidden MCP calls (142 calls total) reclassified from zero-MCP to used-MCP after transcript-first extraction fix (commit 59cdf7db). - Zero-MCP is mostly rational: ~80% of zero-MCP tasks are trivially local (DependEval), have explicit file lists (CodeReview), or have full local codebases (SWE-Perf). Only ~20% warrant investigation.
- Monotonic MCP intensity-reward: Light users +2.2%, Moderate +3.6%, Heavy +6.1% reward improvement. More MCP = more benefit, on tasks where MCP is used at all.
runs/official/
MANIFEST.json
<benchmark>_<variant>_opus_<timestamp>/
baseline/
<YYYY-MM-DD__HH-MM-SS>/ # Batch timestamp
<task_name>__<hash>/ # Task directory
result.json
agent/
claude-code.txt # JSONL transcript
trajectory.json # ATIF structured trace
instruction.txt # Task instructions
CLAUDE.md # Preamble config
verifier/
reward.txt
sourcegraph_full/
[same structure]
archive/ # Archived broken/superseded runs