Catch AI mistakes before they cost weeks of compute. Reproduce papers from arxiv. Debug runs evidence-first. Compare experiments at the right epoch. Launch with discipline.
Built by Fatih Cagatay Akyon (1500+ citations, 7 patents) after 300+ Claude Code sessions, tens of critical AI mistakes caught the hard way, and thousands of hours of PhD research. Every guardrail in this plugin traces to a real mistake.
Claude Code is powerful, but it makes research-specific mistakes that cost weeks of compute:
- It typed "done?" as "dont?" and launched an unwanted upload of thousands of files
- It analyzed my full dataset when I asked for a specific 4k/2k/2k split
- It claimed a test covered a bug it had never actually verified
- It never once looked at a figure it generated, just trusted the numbers
- It restarted a 50-hour training job without diffing the config against the reference run, lost three days
- It claimed an experiment was diverging based on a non-converged proxy metric, killed it before downstream eval would have shown the truth
- It ran
rm -rfon a path it had hallucinated from memory, lost local checkpoints
Other plugins give you more commands. This plugin gives you guardrails.
claude plugin marketplace add fcakyon/phd-skills
claude plugin install phd-skills@phd-skills
The plugin works correctly the moment it is installed. Optional: run /phd-skills:setup for a 30-second tour of what was auto-detected and to opt into extras (notifications, allowlist, LaTeX).
Open Claude Code in your project directory, then:
/phd-skills:reproduce arxiv 2508.12345reproduce a paper from arxiv URL through replication runs"why is my loss diverging?"thedebugskill auto-triggers, runs evidence-first probes"compare run alpha to baseline"thecompareskill auto-triggers, aligns at the same epoch"launch the new training run"thelaunchskill auto-triggers, runs the pre-flight checklist/loop 30m check experiment logs, notify me if metrics beat the baseline or if loss starts to diverge
Notifications (task completion, background agents) forward to ntfy / Slack / email after /phd-skills:setup.
| Command | What it does |
|---|---|
/phd-skills:xray |
Audit paper against code and data (5 parallel dimensions) |
/phd-skills:factcheck |
Verify BibTeX entries and cited claims against DBLP |
/phd-skills:gaps <topic> |
Literature gap analysis with web confirmation |
/phd-skills:fortify [venue] |
Select strongest ablations + anticipate reviewer questions |
/phd-skills:setup |
Auto-detection tour + optional extras |
/phd-skills:help |
Show all features at a glance |
| When you say... | Skill activates |
|---|---|
| "reproduce this arxiv paper" | Reproduce |
| "why is X failing / diverging / OOMing" | Debug |
| "compare run A to baseline" | Compare |
| "launch a new training run" / "kick off training" | Launch |
| "design an ablation study" | Experiment Design |
| "find related papers on X" | Literature Research |
| "check if my numbers match the code" | Paper Verification |
| "review my methods section for consistency" | Paper Writing |
| "analyze dataset bias" | Dataset Curation |
| "prepare code for open-source release" | Research Publishing |
| "what will reviewers ask about this?" | Reviewer Defense |
| "setup latex for CVPR" | LaTeX Setup |
| Agent | What it does | Special |
|---|---|---|
paper-auditor |
Cross-checks paper claims vs code and data | Runs in isolated worktree, remembers patterns across sessions |
experiment-analyzer |
Analyzes results from wandb / neptune / tensorboard / mlflow / local | Hands off to compare and debug skills for discipline |
| phd-skills | flonat/claude-research | Others | |
|---|---|---|---|
| Commands to learn | 6 | 39 | 13-20 |
| Research integrity hooks | 11 (agent + 10 auto-detect) | 1 | 0 |
| Paper reproduction (arxiv to runs) | Yes (7-stage skill) | No | No |
| Paper-code consistency audit | 5-dimension parallel | Read-only, no code cross-ref | None |
| Experiment monitoring + SSH notifications | Yes (ntfy / slack / email) | No | No |
| External dependencies | None | npm + pip + MCP servers | MCP required |
| Install time | 30 seconds | 10+ minutes | Varies |
- Methodology over scripts. Skills teach the approach, Claude generates code for your specific setup (wandb, neptune, local files, whatever)
- Human oversight first. Claude makes premature claims and jumps to conclusions. Every skill builds in verification checkpoints
- Actionable output. Ranked suggestions with specific fixes, never just a list of findings
MIT. Use it, fork it, adapt it to your research.