-
Notifications
You must be signed in to change notification settings - Fork 722
feat(cli): Add lightweight evals framework to inspector #568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cli): Add lightweight evals framework to inspector #568
Conversation
- Add lightweight evals system for testing LLM interactions with MCP servers - Implement tool call validation (required/allowed/prohibited tools) - Add response scoring with regex, JSON schema, and LLM judge support - Add multi-step conversation support and improved error messages - Include sample-evals.json and documentation
Hi @steviec! Wow, this is quite a PR and quite a nice use of the Inspector CLI. I'm quite concerned, though, about the increasing size and complexity of the Inspector project. I was hesitant about the addition of the CLI inside this project originally, and this increases the CLI's complexity a quite a bit. It makes me wonder if the way forward isn't possibly to cleave it off into a separate project or even two. The CLI and an Evaluator project that uses the CLI as a dependency. This would leave this project, as originally conceived - a React UI with a proxy server backend for simplicity. @olaservo @evalstate any thoughts? |
Thanks for the feedback and consideration! I'm happy to rework this as a standalone CLI tool if that seems like a better approach. I'm just motivated to make evals a more normal/expected part of MCP server implementations after installing so many servers that "work" when you just test the tools directly but fail miserably when an LLM tries to use them. |
I was just thinking about building a standalone CLI - I need something like a curl equivalent for quick debugging. Currently I have a complex mcp-within-postman setup to share requests with my team. And then I got the email about this PR. +1 that this would make a great standalone package |
Our internal discussions are leaning toward suggesting your making this a personal project that possibly uses the inspector-cli package as a dependency. The reason is, while it's definitely a useful thing, we aren't well suited to maintain it. We currently are not supporting any models in any way, only testing server functionality built on our SDKs. There are plenty of packages out there that do evals, so it would be a bit of a stretch for us to try and compete with them and support the issues that might arise from the use of model x vs model y, etc. |
No problem, thanks for considering; I hear you on the maintenance overhead of something like this! I'll take a look at pulling this into a standalone CLI. Part of the reason I built this was because I didn't like the architecture of the existing MCP server eval frameworks; they are all (AFAICT) married to particular SDK languages and usually extend whatever unit testing framework that particular SDK uses. Thinking of evals as a peer to what inspector does, completely decoupled from the implementation particulars, feels architecturally really useful and also gives it universal utility for MCP server builders. I'll keep you posted on whether I pull this off :). |
Hi @cliffhall, I finally had enough bandwidth to pull out the evals framework into a standalone CLI tool called mcp-server-tester. Check it out! |
This PR adds a lightweight evals framework to the MCP Inspector CLI via an
--evals
flag for automated testing of LLM interactions with MCP servers. It helps catch issues where LLMs use tools incorrectly or when tools fail in ways that direct testing might miss.Why This Might Be Useful
I've been building MCP servers and noticed that while they often work great when you test tool calls directly using Inspector, things can get wonky when LLMs start using them. LLMs might call the wrong tools, ignore instructions, or use the MCP server results in unexpected/incorrect ways. Usually I resort to hooking them up to Claude Code and just manually playing with them for awhile, but I miss a lot of problems this way.
This seemed like a natural fit for Inspector since it already evaluates MCP servers "from the outside" as a client. Plus, I think it could help establish a pattern where every MCP server ships with a solid test suite.
Quick example first, then I'll share some real results that show why this might be valuable:
Create
my-evals.json
:Run the test:
ANTHROPIC_API_KEY=your-key npx @modelcontextprotocol/inspector --cli --evals my-evals.json node your-server.js
Evals output:
Real Results from My Maestro MCP Server
Here's what happened when I ran this against the MCP server I helped build for the Maestro automation framework. The results were... enlightening:
What This Revealed
Some really interesting patterns emerged:
list_devices
when asked to list available tools - turned out our tool descriptions were confusing. We fixed this and now it consistently works.query_docs
test failed on both models but for completely different reasons:dashboard.yaml
) and got a server errorThese insights helped us improve our tool descriptions and understand how different models interact with our server.
I realize this is a pretty substantial addition and might not fit with the current project direction. But I've found it critical for catching issues that the existing inspector tool misses, and I believe it would help the MCP ecosystem if testing like this became more common. Inspector seemed like the right place because:
What this feature does
The evals framework enables MCP server developers to define a set of tests that will 1) validate correct tool calling and 2) assess the quality of the tool use.
1. Tool Call Validation (via
expectedToolCalls
stanza) - Validates that the LLM calls the right tools using:required
: Tools that must be called for the test to passallowed
: Tools that are permitted but not requiredprohibited
: Tools that should never be called (test fails if used)2. Response Quality Assessment (via
responseScorers
stanza) - Evaluates that the conversation meets expectations using a few different scoring techniques:regex
: Pattern matching in responsesjson-schema
: Structured response validationllm-judge
: Use an LLM prompt to evaluate the entire conversationThese tests are defined in a single JSON file with an easy-to-understand schema, and the test output clearly indicates when tools are not being used correctly and why.
How Has This Been Tested?
cli-tests
suite (though could be improved with a mock provider)Breaking Changes
None - this adds new functionality without changing existing APIs.
Types of changes
Checklist
Additional context
For detailed usage instructions and examples, see the Evals Mode section in the updated README.
Potential Next Steps:
Depending on feedback and project direction, here are some potential paths forward:
@modelcontextprotocol/evaluator
?) package that could be a peer tool to Inspector, focused purely on MCP server evaluationHappy to go in any of these directions based on maintainer preferences and community needs!