feat(cli): Add lightweight evals framework to inspector #568

steviec · 2025-06-28T01:16:41Z

This PR adds a lightweight evals framework to the MCP Inspector CLI via an --evals flag for automated testing of LLM interactions with MCP servers. It helps catch issues where LLMs use tools incorrectly or when tools fail in ways that direct testing might miss.

Why This Might Be Useful

I've been building MCP servers and noticed that while they often work great when you test tool calls directly using Inspector, things can get wonky when LLMs start using them. LLMs might call the wrong tools, ignore instructions, or use the MCP server results in unexpected/incorrect ways. Usually I resort to hooking them up to Claude Code and just manually playing with them for awhile, but I miss a lot of problems this way.

This seemed like a natural fit for Inspector since it already evaluates MCP servers "from the outside" as a client. Plus, I think it could help establish a pattern where every MCP server ships with a solid test suite.

Quick example first, then I'll share some real results that show why this might be valuable:

Create my-evals.json:

{
  "options": {
    "timeout": 30000
  },
  "evals": [
    {
      "name": "read_only_test",
      "prompt": "Please review my README file, but don't make any changes",
      "expectedToolCalls": {
        "required": ["read_file"],
        "prohibited": ["write_file", "delete_file"]
      }
    }
  ]
}

Run the test: ANTHROPIC_API_KEY=your-key npx @modelcontextprotocol/inspector --cli --evals my-evals.json node your-server.js

Evals output:

Running 1 eval tests across 1 model(s) (1 total runs)...

🤖 Running tests with model: claude-3-haiku-20240307
✅ read_only_test: PASSED

Results: 1/1 tests passed

Real Results from My Maestro MCP Server

Here's what happened when I ran this against the MCP server I helped build for the Maestro automation framework. The results were... enlightening:

Running 7 eval tests across 2 model(s) (14 total runs)...

🤖 Running tests with model: claude-3-haiku-20240307
❌ llm_understands_tools: FAILED
    Prompt: "Please list all available tools and capabilities you have access to."
    • Unexpected tool 'list_devices' was called (not in required or allowed list)
✅ list_devices_tool_call: PASSED
✅ start_device_tool_call: PASSED
❌ query_docs_tool_call: FAILED
    Prompt: "How do I tap on an element with specific text in Maestro?"
    • Required tool 'query_docs' was not called
    • Prohibited tool 'tap_on' was called
    • Prohibited tool 'start_device' was called
    • LLM Judge score 0.4 (threshold 0.8): "The assistant's response does not directly address how to tap on an element with specific text in Maestro. Instead, it focuses on troubleshooting issues with starting a device and listing available devices, which are not directly relevant to the original request."
✅ safety_check: PASSED
✅ check_flow_syntax_tool_call: PASSED
❌ run_flow_files_tool_call: FAILED
    Prompt: "Please run our dashboard flow file"
    • Tool call 'run_flow_files' failed: Files not found: dashboard.yaml

🤖 Running tests with model: claude-sonnet-4-20250514
✅ llm_understands_tools: PASSED
✅ list_devices_tool_call: PASSED
✅ start_device_tool_call: PASSED
❌ query_docs_tool_call: FAILED
    Prompt: "How do I tap on an element with specific text in Maestro?"
    • Tool call 'query_docs' failed: MAESTRO_CLOUD_API_KEY environment variable is required
✅ safety_check: PASSED
✅ check_flow_syntax_tool_call: PASSED
❌ run_flow_files_tool_call: FAILED
    Prompt: "Please run our dashboard flow file"
    • Required tool 'run_flow_files' was not called

Results: 8/14 tests passed

What This Revealed

Some really interesting patterns emerged:

Tool description issues: Haiku was calling list_devices when asked to list available tools - turned out our tool descriptions were confusing. We fixed this and now it consistently works.
Different failure modes per model: The query_docs test failed on both models but for completely different reasons:
- Haiku tried to execute additional commands instead of querying docs
- Sonnet correctly identified the tool but hit an API key requirement
Ambiguous prompts expose different behaviors: The "run our dashboard flow file" prompt also failed differently:
- Haiku guessed a filename (dashboard.yaml) and got a server error
- Sonnet just didn't call the tool at all, waiting for clarification

These insights helped us improve our tool descriptions and understand how different models interact with our server.

I realize this is a pretty substantial addition and might not fit with the current project direction. But I've found it critical for catching issues that the existing inspector tool misses, and I believe it would help the MCP ecosystem if testing like this became more common. Inspector seemed like the right place because:

the README calls it the "developer tool for testing and debugging MCP servers", and from my experience the most important testing/debugging of MCP servers happen during interactions with a real LLM
it's everyone's default MCP server tester, and would be the best place to establish MCP evals as a norm
it also has the right interaction model: it evaluates servers from the outside and is not tied to particular MCP implementation language framework

What this feature does

The evals framework enables MCP server developers to define a set of tests that will 1) validate correct tool calling and 2) assess the quality of the tool use.

1. Tool Call Validation (via expectedToolCalls stanza) - Validates that the LLM calls the right tools using:

required: Tools that must be called for the test to pass
allowed: Tools that are permitted but not required
prohibited: Tools that should never be called (test fails if used)

2. Response Quality Assessment (via responseScorers stanza) - Evaluates that the conversation meets expectations using a few different scoring techniques:

regex: Pattern matching in responses
json-schema: Structured response validation
llm-judge: Use an LLM prompt to evaluate the entire conversation

These tests are defined in a single JSON file with an easy-to-understand schema, and the test output clearly indicates when tools are not being used correctly and why.

How Has This Been Tested?

Tested against several real MCP servers (mobile testing, file system)
Confirmed multi-model testing reveals different behavior patterns
Validated that it catches both tool usage errors and actual tool failures
Tested debug logging and error reporting flows
Added tests to the Inspector's cli-tests suite (though could be improved with a mock provider)

Breaking Changes

None - this adds new functionality without changing existing APIs.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

For detailed usage instructions and examples, see the Evals Mode section in the updated README.

Potential Next Steps:

Depending on feedback and project direction, here are some potential paths forward:

Add "Evals" tab in UI: Create a new tab that helps to visualize detailed results of evals in the Inspector's UI
Add mock provider for testing: Create a mock LLM provider to enable deterministic testing of the evals logic without requiring API keys or external calls
Support additional providers: Add support for other LLM providers beyond Anthropic (though this probably isn't a priority given MCP's current ecosystem)
Extract to standalone tool: Move this into a separate "evaluator" (@modelcontextprotocol/evaluator?) package that could be a peer tool to Inspector, focused purely on MCP server evaluation
Remove this entirely what in God's name were you thinking: I accept this as a possible outcome of this PR :). Either way I'll continue to use it because it's become critical in my own MCP server creation workflow.

Happy to go in any of these directions based on maintainer preferences and community needs!

- Add lightweight evals system for testing LLM interactions with MCP servers - Implement tool call validation (required/allowed/prohibited tools) - Add response scoring with regex, JSON schema, and LLM judge support - Add multi-step conversation support and improved error messages - Include sample-evals.json and documentation

cliffhall · 2025-07-01T21:34:29Z

Hi @steviec!

Wow, this is quite a PR and quite a nice use of the Inspector CLI.

I'm quite concerned, though, about the increasing size and complexity of the Inspector project. I was hesitant about the addition of the CLI inside this project originally, and this increases the CLI's complexity a quite a bit.

It makes me wonder if the way forward isn't possibly to cleave it off into a separate project or even two. The CLI and an Evaluator project that uses the CLI as a dependency. This would leave this project, as originally conceived - a React UI with a proxy server backend for simplicity.

@olaservo @evalstate any thoughts?

steviec · 2025-07-02T17:15:20Z

Thanks for the feedback and consideration! I'm happy to rework this as a standalone CLI tool if that seems like a better approach. I'm just motivated to make evals a more normal/expected part of MCP server implementations after installing so many servers that "work" when you just test the tools directly but fail miserably when an LLM tries to use them.

QuantGeekDev · 2025-07-02T17:50:00Z

I was just thinking about building a standalone CLI - I need something like a curl equivalent for quick debugging. Currently I have a complex mcp-within-postman setup to share requests with my team. And then I got the email about this PR. +1 that this would make a great standalone package

cliffhall · 2025-07-02T20:48:08Z

Our internal discussions are leaning toward suggesting your making this a personal project that possibly uses the inspector-cli package as a dependency.

The reason is, while it's definitely a useful thing, we aren't well suited to maintain it. We currently are not supporting any models in any way, only testing server functionality built on our SDKs. There are plenty of packages out there that do evals, so it would be a bit of a stretch for us to try and compete with them and support the issues that might arise from the use of model x vs model y, etc.

steviec · 2025-07-02T21:29:38Z

No problem, thanks for considering; I hear you on the maintenance overhead of something like this! I'll take a look at pulling this into a standalone CLI.

Part of the reason I built this was because I didn't like the architecture of the existing MCP server eval frameworks; they are all (AFAICT) married to particular SDK languages and usually extend whatever unit testing framework that particular SDK uses. Thinking of evals as a peer to what inspector does, completely decoupled from the implementation particulars, feels architecturally really useful and also gives it universal utility for MCP server builders. I'll keep you posted on whether I pull this off :).

steviec · 2025-07-24T00:43:24Z

Hi @cliffhall, I finally had enough bandwidth to pull out the evals framework into a standalone CLI tool called mcp-server-tester. Check it out!

steviec added 12 commits June 27, 2025 16:47

WIP

eb6f33d

Refactor; improved code and fixed a lot of LLM-infused antipatterns

3d5cd55

More refactoring

2f51852

More refactoring based on testing against live MCP servers

c807353

Failing tests if tool calls fail

0d5316d

Updated README and sample-evals.json

dfb3944

Update package-lock.json after dependency resolution

96261cd

Fixing linting error

ff2a731

Add 4 tests for evals framework to test suite

225a9f1

Revert cli.ts to main branch version

289b8a8

Minor cli test improvement

b6076d9

cliffhall closed this Jul 2, 2025

steviec mentioned this pull request Jul 10, 2025

Optimize view hierarchy's CSV format mobile-dev-inc/Maestro#2577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): Add lightweight evals framework to inspector #568

feat(cli): Add lightweight evals framework to inspector #568

Uh oh!

steviec commented Jun 28, 2025 •

edited

Loading

Uh oh!

cliffhall commented Jul 1, 2025

Uh oh!

steviec commented Jul 2, 2025

Uh oh!

QuantGeekDev commented Jul 2, 2025

Uh oh!

cliffhall commented Jul 2, 2025

Uh oh!

steviec commented Jul 2, 2025

Uh oh!

steviec commented Jul 24, 2025

Uh oh!

Uh oh!

feat(cli): Add lightweight evals framework to inspector #568

feat(cli): Add lightweight evals framework to inspector #568

Uh oh!

Conversation

steviec commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why This Might Be Useful

What this feature does

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

cliffhall commented Jul 1, 2025

Uh oh!

steviec commented Jul 2, 2025

Uh oh!

QuantGeekDev commented Jul 2, 2025

Uh oh!

cliffhall commented Jul 2, 2025

Uh oh!

steviec commented Jul 2, 2025

Uh oh!

steviec commented Jul 24, 2025

Uh oh!

Uh oh!

steviec commented Jun 28, 2025 •

edited

Loading