Skill: Evaluate Benchmark Traces

Comprehensive evaluation of benchmark run traces: data integrity validation, output quality assessment, and efficiency analysis across configurations and benchmarks.

Trigger Phrases

evaluate traces, analyze traces, review benchmark results, audit traces, check eval results, analyze run, benchmark evaluation, evaluate run, trace quality, run quality

Instructions

When triggered, perform a comprehensive trace evaluation covering three dimensions: (1) data integrity, (2) output quality, and (3) efficiency. Accept optional arguments to scope the analysis (e.g., specific suite, config, or run).

Phase 1: Scope Selection

If the user provides arguments (e.g., evaluate traces for SWE-bench Pro SG_full), use those to scope. Otherwise, evaluate ALL official runs.

Key paths and tools:

MANIFEST: runs/official/MANIFEST.json (symlink to evals dir — not in git)
Audit script: python3 scripts/audit_traces.py [--json] [--suite X] [--config X]
Manifest generator: python3 scripts/generate_manifest.py
Run directory: runs/official/
Configs: baseline, sourcegraph_full

Critical technical details:

runs/official is a symlink — use the real path for find commands
MCP tool names in traces may have sg_ prefix (mcp__sourcegraph__sg_keyword_search) OR not (mcp__sourcegraph__keyword_search) depending on batch vintage. Always check for both patterns.
Batch timestamp dirs use pattern YYYY-MM-DD__HH-MM-SS with __ separator — don't confuse with task dirs (task_name__hash)
Task-level result.json has full data; batch-level result.json only has aggregate stats
Transcript vs trajectory: claude-code.txt includes ALL tool calls (including Task subagent MCP calls). trajectory.json only has main-agent calls and UNDERCOUNTS MCP usage. Always prefer transcript for tool counting.

Phase 2: Data Integrity Audit

Launch parallel subagents to check data integrity:

2a. MCP Adoption Validation

For each SG config task in MANIFEST:

Read agent/claude-code.txt (NOT trajectory.json) and count mcp__sourcegraph tool_use invocations
Check both sg_ prefix and non-prefix tool name variants
Verify SG_full tasks use MCP tools (should be ~100% of scored tasks)
Verify SG_full tasks use MCP tools AND check Deep Search (deepsearch) adoption
Classify tasks into used-MCP vs zero-MCP groups:
- Used-MCP: At least 1 MCP tool call in transcript. Further classify by intensity:
  - Light (1-5 calls): Minimal MCP usage, likely spot checks
  - Moderate (6-20 calls): Regular MCP usage during exploration
  - Heavy (20+ calls): MCP-centric workflow
- Zero-MCP: MCP available but 0 calls. Classify reason:
  - Trivially local (DependEval dependency_recognition — all data in local files)
  - Explicit file list (CodeReview — instructions name exact files)
  - Full local codebase (SWE-Perf — complete repo in container)
  - Both configs failed (neither BL nor SF produced useful output)
  - Agent confusion (needs transcript investigation)
Report zero-MCP rate per benchmark. High zero-MCP (>30%) suggests MCP isn't valuable for that task type.
Important: Compute reward/time deltas SEPARATELY for used-MCP and zero-MCP groups. Mixing them dilutes the signal (zero-MCP tasks add preamble overhead with no MCP benefit).

2b. Baseline Contamination Check

Scan ALL baseline claude-code.txt for any mcp__sourcegraph tool calls (should be 0)
Check baseline instruction.txt for Sourcegraph/MCP references (cosmetic, not functional)

2c. Infrastructure Failure Detection

Zero-token tasks: n_input_tokens=0, n_output_tokens=0 → auth failures (agent never ran)
Crash failures: n_input/n_output=null, no trajectory, <=5 claude-code.txt lines → Docker/Node.js crash
Null-token H3 bug: null tokens but agent ran fine (50+ cc_lines, valid rewards) — NOT failures
Exceptions: AgentSetupTimeoutError, RuntimeError in result.json
Setup failures: Non-zero return code in agent/setup/stdout.txt

2d. Dedup Integrity

Verify MANIFEST dedup prefers non-zero-token results over zero-token results
Check for auth-failed runs that may corrupt scores via timestamp-based dedup
Regenerate MANIFEST if issues found: python3 scripts/generate_manifest.py

Phase 3: Output Quality Assessment

Read the MANIFEST and compute quality metrics:

3a. Per-Suite Reward Analysis

For each benchmark suite, report:

Suite	Config	Tasks	Scored	Errored	Mean Reward	Pass Rate	Delta vs BL

Where:

Scored = tasks - errored (infra failures excluded from mean)
Pass Rate = passed / scored
Delta vs BL = suite mean under config minus suite mean under baseline

3b. Cross-Config Comparison (matched tasks)

For fair comparison, compute metrics only on tasks that ran successfully across ALL configs:

Identify intersection of scored tasks per suite across baseline, SG_full
Compute matched-task means for each config
Report which tasks flipped outcome (pass→fail or fail→pass) between configs

3c. Task-Level Quality Patterns

Identify and report:

MCP helps: Tasks where SG configs improve reward over baseline
MCP hurts: Tasks where SG configs decrease reward
MCP neutral: No change
Full helps only: SG_full improves over baseline (richer context tooling value)
Persistent failures: Tasks that fail across ALL configs (task difficulty, not config issue)
Config-specific failures: Tasks that fail only in one config (investigate MCP distraction effect)

3d. Benchmark Category Insights

Group findings by benchmark type:

Search-heavy (K8s Docs, LargeRepo, LoCoBench): MCP should help with efficiency
Implementation-heavy (TAC, SWE-Perf, PyTorch): MCP may distract from coding
Mixed (SWE-bench Pro, CrossRepo): Variable MCP impact
Local-only (DependEval, DIBench, RepoQA): MCP provides little value

Phase 4: Efficiency Analysis

Extract efficiency metrics from result.json and traces:

4a. Token Usage

For each suite × config, compute:

Mean input tokens, output tokens, cache tokens
Total cost estimate (input × $15/M + output × $75/M for Opus)
Token ratio: cache_tokens / input_tokens (cache efficiency)

4b. Wall Clock Time

From started_at / finished_at in result.json:

Mean wall clock seconds per task
Wall clock delta: SG configs vs baseline (positive = slower)
Identify suites where MCP is faster (LargeRepo, K8s Docs typically)

4c. MCP Tool Distribution

For SG configs, report tool usage breakdown:

Top tools by call count and task coverage
keyword_search typically dominates (~40-50% of calls)
read_file and list_files are 2nd/3rd
Deep Search actual usage (tool_use events, not init listings)
Unused tools: go_to_definition, get_contributor_repos typically unused
Preamble overhead: Zero-MCP tasks still incur ~26% time and ~40% cost overhead from preamble injection. Factor this into cost-effectiveness calculations.

For deeper MCP-conditioned analysis, use /mcp-audit which pairs tasks and computes deltas separately for used-MCP vs zero-MCP groups.

4d. Cost-Effectiveness

Compute cost per unit of reward:

Cost per scored task = total_cost / scored_tasks
Cost per reward point = total_cost / total_reward
MCP overhead = (SG_cost - BL_cost) / BL_cost
Value ratio: reward_delta / cost_delta (is the MCP cost justified?)

Phase 5: Synthesis and Report

Produce a structured report with:

Executive Summary: 3-5 bullet points on key findings
Data Quality: Pass/fail status for each integrity check
Corrected Scores: Per-suite × per-config table with errored tasks excluded
Weighted Averages: Overall mean across all suites per config
MCP Value Assessment: Where MCP helps (efficiency), where it hurts (distraction), where neutral
Efficiency Comparison: Cost and speed table by config
Recommendations: Actionable items (reruns needed, config changes, investigation items)

Write the report to docs/TRACE_AUDIT_<date>.md.

Phase 6: Follow-up

After presenting results, offer:

Create beads issues for any rerun needs or investigation items
Regenerate MANIFEST if data corrections were applied
Update MEMORY.md with key findings for future sessions

Key Files

File	Path	Contents
MANIFEST	`runs/official/MANIFEST.json`	Canonical run tracking (suite/config → task results)
Batch result	`<config>/<datetime>/result.json`	Aggregate timing and counts
Task result	`<config>/<datetime>/<task__hash>/result.json`	Reward, tokens, exceptions
Agent trace	`<task__hash>/agent/claude-code.txt`	Full Claude Code JSONL transcript
Trajectory	`<task__hash>/agent/trajectory.json`	Structured step log
Instructions	`<task__hash>/agent/instruction.txt`	Instructions given to agent
CLAUDE.md	`<task__hash>/agent/CLAUDE.md`	Preamble + workspace config

Analysis Scripts

Script	Usage
`scripts/audit_traces.py`	Trace audit: tool counts, MCP adoption, errors, compliance
`scripts/mcp_audit.py`	MCP-conditioned paired analysis: used vs zero-MCP, intensity buckets
`scripts/generate_manifest.py`	Rebuild MANIFEST from on-disk results
`scripts/aggregate_status.py`	Run scanner with error fingerprinting
`scripts/compare_configs.py`	Cross-config divergence analysis
`scripts/cost_report.py`	Token usage and cost aggregation
`scripts/reextract_all_metrics.py`	Batch re-extract task_metrics.json after bug fixes

Known Patterns

Zero-token (int 0): Auth failures — agent started but auth failed. Exactly 3 claude-code.txt lines.
Null-token + no trajectory + <=5 lines: Crash failures (protonmail Node v16, openlibrary gpg)
Null-token + valid rewards: H3 token-logging bug — agent ran fine, just tokens not recorded
MCP distraction on TAC: MCP overuse on implementation tasks can reduce scores
Deep Search unused: Only ~1% of SG_full tasks actually invoke deepsearch (agent prefers sync tools)
SWE-Perf regression: MCP can hurt SWE-Perf (performance tasks need focused coding, not search)
Subagent MCP calls hidden: Task subagent MCP calls only appear in claude-code.txt, NOT trajectory.json. ~11 tasks had hidden MCP calls (142 calls total) reclassified from zero-MCP to used-MCP after transcript-first extraction fix (commit 59cdf7db).
Zero-MCP is mostly rational: ~80% of zero-MCP tasks are trivially local (DependEval), have explicit file lists (CodeReview), or have full local codebases (SWE-Perf). Only ~20% warrant investigation.
Monotonic MCP intensity-reward: Light users +2.2%, Moderate +3.6%, Heavy +6.1% reward improvement. More MCP = more benefit, on tasks where MCP is used at all.

Run Directory Layout

runs/official/
  MANIFEST.json
  <benchmark>_<variant>_opus_<timestamp>/
    baseline/
      <YYYY-MM-DD__HH-MM-SS>/          # Batch timestamp
        <task_name>__<hash>/            # Task directory
          result.json
          agent/
            claude-code.txt             # JSONL transcript
            trajectory.json             # ATIF structured trace
            instruction.txt             # Task instructions
            CLAUDE.md                   # Preamble config
          verifier/
            reward.txt
    sourcegraph_full/
      [same structure]
  archive/                              # Archived broken/superseded runs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skill: Evaluate Benchmark Traces

Trigger Phrases

Instructions

Phase 1: Scope Selection

Phase 2: Data Integrity Audit

2a. MCP Adoption Validation

2b. Baseline Contamination Check

2c. Infrastructure Failure Detection

2d. Dedup Integrity

Phase 3: Output Quality Assessment

3a. Per-Suite Reward Analysis

3b. Cross-Config Comparison (matched tasks)

3c. Task-Level Quality Patterns

3d. Benchmark Category Insights

Phase 4: Efficiency Analysis

4a. Token Usage

4b. Wall Clock Time

4c. MCP Tool Distribution

4d. Cost-Effectiveness

Phase 5: Synthesis and Report

Phase 6: Follow-up

Key Files

Analysis Scripts

Known Patterns

Run Directory Layout

FilesExpand file tree

SKILL.md

Latest commit

History

SKILL.md

File metadata and controls

Skill: Evaluate Benchmark Traces

Trigger Phrases

Instructions

Phase 1: Scope Selection

Phase 2: Data Integrity Audit

2a. MCP Adoption Validation

2b. Baseline Contamination Check

2c. Infrastructure Failure Detection

2d. Dedup Integrity

Phase 3: Output Quality Assessment

3a. Per-Suite Reward Analysis

3b. Cross-Config Comparison (matched tasks)

3c. Task-Level Quality Patterns

3d. Benchmark Category Insights

Phase 4: Efficiency Analysis

4a. Token Usage

4b. Wall Clock Time

4c. MCP Tool Distribution

4d. Cost-Effectiveness

Phase 5: Synthesis and Report

Phase 6: Follow-up

Key Files

Analysis Scripts

Known Patterns

Run Directory Layout