Skip to content

Latest commit

 

History

History
126 lines (103 loc) · 6.66 KB

File metadata and controls

126 lines (103 loc) · 6.66 KB

AI Agent Skills for CodeScaleBench

This directory contains reusable skill definitions for AI coding agents (Claude Code, Cursor, Copilot, etc.) that operate on this repository. Skills are structured instructions that tell an AI agent how to perform specific operational tasks — think of them as runbooks that an agent can follow autonomously.

Only project-specific (CSB) skills are kept here. General-purpose skills (coding standards, security review, TDD, agent delegation, etc.) live in ~/.claude/skills/ and in .cursor/rules/ as separate .mdc files.

Why Skills?

Running a benchmark suite like CodeScaleBench involves many repetitive multi-step workflows: validating tasks, launching runs, triaging failures, comparing configs, generating reports. Rather than re-explaining these workflows each session, skills encode the operational knowledge once and let any AI agent execute them reliably.

Skills are particularly valuable for:

  • Onboarding — New operators (human or AI) can immediately operate the benchmark
  • Consistency — The same procedure runs the same way every time
  • Composability — Skills can be chained (e.g., check-infra → validate-tasks → run-benchmark)
  • Tool-agnostic — Works with any agent that reads markdown instructions

Directory Structure

skills/
├── README.md                  ← You are here
├── csb/                       ← Consolidated CSB skill guides (grouped by phase)
│   ├── pre-run.md             ← Infrastructure checks, task validation, launching runs
│   ├── monitoring.md          ← Run status, watching benchmark progress
│   ├── triage-rerun.md        ← Failure investigation, quick reruns to verify fixes
│   ├── analysis.md            ← Config comparison, MCP audit, IR metrics, cost reports
│   ├── maintenance.md         ← Metadata sync, metric re-extraction, archiving, reports
│   └── task-authoring.md      ← Scaffolding new tasks, quality scoring, ABC audits
│
├── archive-run/SKILL.md       ← Individual skill runbooks (one per skill)
├── benchmark-audit/SKILL.md
├── check-infra/SKILL.md
├── compare-configs/SKILL.md
├── cost-report/SKILL.md
├── evaluate-traces/SKILL.md
├── generate-report/SKILL.md
├── ir-analysis/SKILL.md
├── mcp-audit/SKILL.md
├── quick-rerun/SKILL.md
├── reextract-metrics/SKILL.md
├── repo-health/SKILL.md
├── run-benchmark/SKILL.md
├── run-status/SKILL.md
├── scaffold-task/SKILL.md
├── score-tasks/SKILL.md
├── sync-metadata/SKILL.md
├── triage-failure/SKILL.md
├── validate-tasks/SKILL.md
├── watch-benchmarks/SKILL.md
└── whats-next/SKILL.md

Skill Index

Individual Skills (per-directory SKILL.md)

Skill Directory When to Use
Archive Run archive-run Clean up old completed runs to save disk
Benchmark Audit benchmark-audit ABC framework compliance audit
Check Infrastructure check-infra Before any benchmark run
Compare Configs compare-configs Finding signal between baseline and MCP configs
Cost Report cost-report Token usage and cost breakdown
Evaluate Traces evaluate-traces Comprehensive trace audit
Generate Report generate-report Producing evaluation reports
IR Analysis ir-analysis Measuring file retrieval quality
MCP Audit mcp-audit Analyzing MCP tool usage patterns
Quick Rerun quick-rerun Verifying a fix on a single task
Re-extract Metrics reextract-metrics After extraction bug fixes
Repo Health repo-health Before syncing changes — reduce drift and keep repository checks green
Run Benchmark run-benchmark Launching paired or gap-fill benchmark runs
Run Status run-status Quick check on active runs
Scaffold Task scaffold-task Creating new benchmark tasks
Score Tasks score-tasks Quality-scoring task definitions
Sync Metadata sync-metadata Keeping task.toml in sync with registry
Triage Failure triage-failure Investigating why a task failed
Validate Tasks validate-tasks Before launching, after editing task definitions
Watch Benchmarks watch-benchmarks Full status dashboard for all runs
What's Next whats-next Deciding the highest-value next action

Consolidated Guides (csb/)

The csb/ subdirectory groups the same skills by workflow phase for quick reference. These are summaries — the individual SKILL.md files have the full detail.

Guide File Covers
Pre-Run csb/pre-run.md check-infra, validate-tasks, run-benchmark
Monitoring csb/monitoring.md run-status, watch-benchmarks
Triage & Rerun csb/triage-rerun.md triage-failure, quick-rerun
Analysis csb/analysis.md compare-configs, mcp-audit, ir-analysis, cost-report, evaluate-traces
Maintenance csb/maintenance.md sync-metadata, reextract-metrics, archive-run, generate-report, whats-next
Task Authoring csb/task-authoring.md scaffold-task, score-tasks, benchmark-audit

How to Use These Skills

With Cursor (.cursor/rules/)

All CSB skills are already installed as individual .mdc rules in .cursor/rules/. Cursor will auto-load them based on file glob matching. The rules are named to match the skill directories (e.g., .cursor/rules/check-infra.mdc corresponds to skills/check-infra/SKILL.md).

With Claude Code (CLAUDE.md)

Reference skills from your CLAUDE.md or AGENTS.md:

## Skills Reference
See `skills/` for operational runbooks:
- Pre-run checklist: `skills/check-infra/SKILL.md`
- Failure triage: `skills/triage-failure/SKILL.md`

Claude Code will read the files when relevant context is needed.

With Other Agents

Skills are plain markdown — any agent that can read files can use them. Point the agent at the relevant skill file when starting a task:

Read skills/mcp-audit/SKILL.md and then run an MCP audit for the latest benchmark run.