Behavioral cross-agent testing for agent skills. Run a skill for real on Claude Code and Codex, assert deterministic outcomes, and get a pass/fail behavior matrix.
If skillport is the static check ("does this SKILL.md use anything that won't port?"), skillmatrix is the behavioral one: it actually runs the skill headlessly on each agent against an isolated workspace and checks what the skill did.
Static vs behavioral. skillport reads your skill in milliseconds and flags Claude-only syntax. skillmatrix runs it on each agent and verifies the result. Use skillport as the fast pre-check; use skillmatrix to prove behavior. Lint first, then run.
For every (test × agent) pair, skillmatrix:
- Creates a fresh, isolated workspace.
- Installs the skill into that agent's discovery path (
.claude/skills/for Claude Code,.agents/skills/for Codex). - Runs the agent headlessly (
claude -p/codex exec) with your prompt. - Evaluates deterministic filesystem assertions against the resulting workspace.
The robust, cross-agent signal is the filesystem effect (did the skill produce the files it promised, with the right content?) plus a clean exit — identical semantics on every agent.
A test is a *.smtest.json file:
{
"name": "make-marker creates the marker file",
"skill": "./skills/make-marker",
"prompt": "Use your make-marker skill to create its marker file.",
"agents": ["claude-code", "codex"],
"assert": [{ "file": "MARKER.txt", "equals": "skillmatrix-marker-ok" }]
}Run it (this invokes the real claude / codex CLIs and uses their quota):
npx github:skyswordw/skillmatrix examples/make-markermake-marker creates the marker file
claude-code ✓ · codex ✓
2 cell(s): 2 ✓ pass · 0 ✗ fail
When a skill behaves differently on one agent, you see exactly where:
make-marker creates the marker file
claude-code ✓ · codex ✗
✗ codex:
- expected file "MARKER.txt" to exist, but it does not
Each assertion targets a file (relative to the workspace) with exactly one matcher:
| Matcher | Passes when |
|---|---|
"exists": true / false |
the file is present / absent |
"equals": "..." |
content equals the value (both sides trimmed) |
"contains": "..." |
content contains the substring |
"matches": "regex" |
content matches the JS regex (raw content) |
"json": { "path": "a.b.0", "equals": ... } |
the file parses as JSON and the value at the dot-path deep-equals equals |
"directory": true |
file is a directory |
"glob": "data/*.json", "count": 2 |
files match the glob (optional exact count; default ≥1) |
"trace": { "path": "is_error", "equals": false } |
the agent's JSON stdout: value at the dot-path equals / contains |
File paths are confined to the workspace — absolute paths and .. escapes are rejected (at parse and at evaluation). Content is CRLF-normalized before matching.
Seed input files into the workspace before the run with a top-level "files" map ({ "data/input.txt": "..." }), in addition to (or instead of) a "fixture" directory.
examples/scaffold-report is a runnable example that exercises the directory, glob+count, and json matchers — verified passing on both Claude Code and Codex:
npx @skyswordw/skillmatrix examples/scaffold-report
# scaffold-report builds the report tree
# claude-code ✓ · codex ✓skillmatrix [path] [options]
-a, --agent <list> claude-code,codex,all (default: all)
--json machine-readable output
--keep keep run workspaces for debugging
--work-dir <d> base dir for workspaces (default: OS temp)
--timeout <sec> per-run timeout (default: 240)
Exit code is non-zero if any cell fails — drop it into CI. Note: CI needs the claude / codex CLIs authenticated on the runner.
v0.1 — deterministic filesystem assertions across Claude Code + Codex, proven end-to-end by a feasibility spike (spike/). Roadmap: more assertion kinds (command-was-run via the agents' JSON traces), Cursor once its skill CLI matures, and tighter integration with skillport (lint → run in one pass).
npm install # project-local; no global installs
npm run check # typecheck
npm test # build + run the node:test suite (uses a fake executor — no real agent calls)
npm run build # emit dist/The agent runner is injectable (AgentExecutor), so the full orchestration is unit-tested without spawning real agents. The real executor (src/exec.ts) shells out to the CLIs.
Part of a small set of honest-by-default QA tools for AI-assisted development:
- skillport — static cross-agent skill linter: does your
SKILL.mdport across agents? - skillmatrix (this repo) — behavioral cross-agent skill testing
- claimcheck — a CI receipt for the claims your PR makes
MIT © skyswordw