Skip to content

skyswordw/skillmatrix

Repository files navigation

skillmatrix

npm ci license: MIT node >=20

Behavioral cross-agent testing for agent skills. Run a skill for real on Claude Code and Codex, assert deterministic outcomes, and get a pass/fail behavior matrix.

If skillport is the static check ("does this SKILL.md use anything that won't port?"), skillmatrix is the behavioral one: it actually runs the skill headlessly on each agent against an isolated workspace and checks what the skill did.

Static vs behavioral. skillport reads your skill in milliseconds and flags Claude-only syntax. skillmatrix runs it on each agent and verifies the result. Use skillport as the fast pre-check; use skillmatrix to prove behavior. Lint first, then run.

How it works

For every (test × agent) pair, skillmatrix:

  1. Creates a fresh, isolated workspace.
  2. Installs the skill into that agent's discovery path (.claude/skills/ for Claude Code, .agents/skills/ for Codex).
  3. Runs the agent headlessly (claude -p / codex exec) with your prompt.
  4. Evaluates deterministic filesystem assertions against the resulting workspace.

The robust, cross-agent signal is the filesystem effect (did the skill produce the files it promised, with the right content?) plus a clean exit — identical semantics on every agent.

Quick start

A test is a *.smtest.json file:

{
  "name": "make-marker creates the marker file",
  "skill": "./skills/make-marker",
  "prompt": "Use your make-marker skill to create its marker file.",
  "agents": ["claude-code", "codex"],
  "assert": [{ "file": "MARKER.txt", "equals": "skillmatrix-marker-ok" }]
}

Run it (this invokes the real claude / codex CLIs and uses their quota):

npx github:skyswordw/skillmatrix examples/make-marker
make-marker creates the marker file
  claude-code ✓  ·  codex ✓

2 cell(s): 2 ✓ pass · 0 ✗ fail

When a skill behaves differently on one agent, you see exactly where:

make-marker creates the marker file
  claude-code ✓  ·  codex ✗
  ✗ codex:
     - expected file "MARKER.txt" to exist, but it does not

Assertions

Each assertion targets a file (relative to the workspace) with exactly one matcher:

Matcher Passes when
"exists": true / false the file is present / absent
"equals": "..." content equals the value (both sides trimmed)
"contains": "..." content contains the substring
"matches": "regex" content matches the JS regex (raw content)
"json": { "path": "a.b.0", "equals": ... } the file parses as JSON and the value at the dot-path deep-equals equals
"directory": true file is a directory
"glob": "data/*.json", "count": 2 files match the glob (optional exact count; default ≥1)
"trace": { "path": "is_error", "equals": false } the agent's JSON stdout: value at the dot-path equals / contains

File paths are confined to the workspace — absolute paths and .. escapes are rejected (at parse and at evaluation). Content is CRLF-normalized before matching.

Seed input files into the workspace before the run with a top-level "files" map ({ "data/input.txt": "..." }), in addition to (or instead of) a "fixture" directory.

examples/scaffold-report is a runnable example that exercises the directory, glob+count, and json matchers — verified passing on both Claude Code and Codex:

npx @skyswordw/skillmatrix examples/scaffold-report
# scaffold-report builds the report tree
#   claude-code ✓  ·  codex ✓

CLI

skillmatrix [path] [options]

  -a, --agent <list>   claude-code,codex,all   (default: all)
      --json           machine-readable output
      --keep           keep run workspaces for debugging
      --work-dir <d>   base dir for workspaces (default: OS temp)
      --timeout <sec>  per-run timeout (default: 240)

Exit code is non-zero if any cell fails — drop it into CI. Note: CI needs the claude / codex CLIs authenticated on the runner.

Status

v0.1 — deterministic filesystem assertions across Claude Code + Codex, proven end-to-end by a feasibility spike (spike/). Roadmap: more assertion kinds (command-was-run via the agents' JSON traces), Cursor once its skill CLI matures, and tighter integration with skillport (lint → run in one pass).

Development

npm install      # project-local; no global installs
npm run check    # typecheck
npm test         # build + run the node:test suite (uses a fake executor — no real agent calls)
npm run build    # emit dist/

The agent runner is injectable (AgentExecutor), so the full orchestration is unit-tested without spawning real agents. The real executor (src/exec.ts) shells out to the CLIs.

Related

Part of a small set of honest-by-default QA tools for AI-assisted development:

  • skillport — static cross-agent skill linter: does your SKILL.md port across agents?
  • skillmatrix (this repo) — behavioral cross-agent skill testing
  • claimcheck — a CI receipt for the claims your PR makes

License

MIT © skyswordw

About

Behavioral cross-agent testing for agent skills — run a skill on Claude Code and Codex and assert deterministic outcomes.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors