Skip to content

feat(cli): Add lightweight evals framework to inspector #568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

steviec
Copy link

@steviec steviec commented Jun 28, 2025

This PR adds a lightweight evals framework to the MCP Inspector CLI via an --evals flag for automated testing of LLM interactions with MCP servers. It helps catch issues where LLMs use tools incorrectly or when tools fail in ways that direct testing might miss.

Why This Might Be Useful

I've been building MCP servers and noticed that while they often work great when you test tool calls directly using Inspector, things can get wonky when LLMs start using them. LLMs might call the wrong tools, ignore instructions, or use the MCP server results in unexpected/incorrect ways. Usually I resort to hooking them up to Claude Code and just manually playing with them for awhile, but I miss a lot of problems this way.

This seemed like a natural fit for Inspector since it already evaluates MCP servers "from the outside" as a client. Plus, I think it could help establish a pattern where every MCP server ships with a solid test suite.

Quick example first, then I'll share some real results that show why this might be valuable:

Create my-evals.json:

{
  "options": {
    "timeout": 30000
  },
  "evals": [
    {
      "name": "read_only_test",
      "prompt": "Please review my README file, but don't make any changes",
      "expectedToolCalls": {
        "required": ["read_file"],
        "prohibited": ["write_file", "delete_file"]
      }
    }
  ]
}

Run the test: ANTHROPIC_API_KEY=your-key npx @modelcontextprotocol/inspector --cli --evals my-evals.json node your-server.js

Evals output:

Running 1 eval tests across 1 model(s) (1 total runs)...

🤖 Running tests with model: claude-3-haiku-20240307
✅ read_only_test: PASSED

Results: 1/1 tests passed

Real Results from My Maestro MCP Server

Here's what happened when I ran this against the MCP server I helped build for the Maestro automation framework. The results were... enlightening:

Running 7 eval tests across 2 model(s) (14 total runs)...

🤖 Running tests with model: claude-3-haiku-20240307
❌ llm_understands_tools: FAILED
    Prompt: "Please list all available tools and capabilities you have access to."
    • Unexpected tool 'list_devices' was called (not in required or allowed list)
✅ list_devices_tool_call: PASSED
✅ start_device_tool_call: PASSED
❌ query_docs_tool_call: FAILED
    Prompt: "How do I tap on an element with specific text in Maestro?"
    • Required tool 'query_docs' was not called
    • Prohibited tool 'tap_on' was called
    • Prohibited tool 'start_device' was called
    • LLM Judge score 0.4 (threshold 0.8): "The assistant's response does not directly address how to tap on an element with specific text in Maestro. Instead, it focuses on troubleshooting issues with starting a device and listing available devices, which are not directly relevant to the original request."
✅ safety_check: PASSED
✅ check_flow_syntax_tool_call: PASSED
❌ run_flow_files_tool_call: FAILED
    Prompt: "Please run our dashboard flow file"
    • Tool call 'run_flow_files' failed: Files not found: dashboard.yaml

🤖 Running tests with model: claude-sonnet-4-20250514
✅ llm_understands_tools: PASSED
✅ list_devices_tool_call: PASSED
✅ start_device_tool_call: PASSED
❌ query_docs_tool_call: FAILED
    Prompt: "How do I tap on an element with specific text in Maestro?"
    • Tool call 'query_docs' failed: MAESTRO_CLOUD_API_KEY environment variable is required
✅ safety_check: PASSED
✅ check_flow_syntax_tool_call: PASSED
❌ run_flow_files_tool_call: FAILED
    Prompt: "Please run our dashboard flow file"
    • Required tool 'run_flow_files' was not called

Results: 8/14 tests passed

What This Revealed

Some really interesting patterns emerged:

  • Tool description issues: Haiku was calling list_devices when asked to list available tools - turned out our tool descriptions were confusing. We fixed this and now it consistently works.
  • Different failure modes per model: The query_docs test failed on both models but for completely different reasons:
    • Haiku tried to execute additional commands instead of querying docs
    • Sonnet correctly identified the tool but hit an API key requirement
  • Ambiguous prompts expose different behaviors: The "run our dashboard flow file" prompt also failed differently:
    • Haiku guessed a filename (dashboard.yaml) and got a server error
    • Sonnet just didn't call the tool at all, waiting for clarification

These insights helped us improve our tool descriptions and understand how different models interact with our server.

I realize this is a pretty substantial addition and might not fit with the current project direction. But I've found it critical for catching issues that the existing inspector tool misses, and I believe it would help the MCP ecosystem if testing like this became more common. Inspector seemed like the right place because:

  • the README calls it the "developer tool for testing and debugging MCP servers", and from my experience the most important testing/debugging of MCP servers happen during interactions with a real LLM
  • it's everyone's default MCP server tester, and would be the best place to establish MCP evals as a norm
  • it also has the right interaction model: it evaluates servers from the outside and is not tied to particular MCP implementation language framework

What this feature does

The evals framework enables MCP server developers to define a set of tests that will 1) validate correct tool calling and 2) assess the quality of the tool use.

1. Tool Call Validation (via expectedToolCalls stanza) - Validates that the LLM calls the right tools using:

  • required: Tools that must be called for the test to pass
  • allowed: Tools that are permitted but not required
  • prohibited: Tools that should never be called (test fails if used)

2. Response Quality Assessment (via responseScorers stanza) - Evaluates that the conversation meets expectations using a few different scoring techniques:

  • regex: Pattern matching in responses
  • json-schema: Structured response validation
  • llm-judge: Use an LLM prompt to evaluate the entire conversation

These tests are defined in a single JSON file with an easy-to-understand schema, and the test output clearly indicates when tools are not being used correctly and why.

How Has This Been Tested?

  • Tested against several real MCP servers (mobile testing, file system)
  • Confirmed multi-model testing reveals different behavior patterns
  • Validated that it catches both tool usage errors and actual tool failures
  • Tested debug logging and error reporting flows
  • Added tests to the Inspector's cli-tests suite (though could be improved with a mock provider)

Breaking Changes

None - this adds new functionality without changing existing APIs.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

For detailed usage instructions and examples, see the Evals Mode section in the updated README.

Potential Next Steps:

Depending on feedback and project direction, here are some potential paths forward:

  • Add "Evals" tab in UI: Create a new tab that helps to visualize detailed results of evals in the Inspector's UI
  • Add mock provider for testing: Create a mock LLM provider to enable deterministic testing of the evals logic without requiring API keys or external calls
  • Support additional providers: Add support for other LLM providers beyond Anthropic (though this probably isn't a priority given MCP's current ecosystem)
  • Extract to standalone tool: Move this into a separate "evaluator" (@modelcontextprotocol/evaluator?) package that could be a peer tool to Inspector, focused purely on MCP server evaluation
  • Remove this entirely what in God's name were you thinking: I accept this as a possible outcome of this PR :). Either way I'll continue to use it because it's become critical in my own MCP server creation workflow.

Happy to go in any of these directions based on maintainer preferences and community needs!

@cliffhall
Copy link
Member

Hi @steviec!

Wow, this is quite a PR and quite a nice use of the Inspector CLI.

I'm quite concerned, though, about the increasing size and complexity of the Inspector project. I was hesitant about the addition of the CLI inside this project originally, and this increases the CLI's complexity a quite a bit.

It makes me wonder if the way forward isn't possibly to cleave it off into a separate project or even two. The CLI and an Evaluator project that uses the CLI as a dependency. This would leave this project, as originally conceived - a React UI with a proxy server backend for simplicity.

@olaservo @evalstate any thoughts?

@steviec
Copy link
Author

steviec commented Jul 2, 2025

Thanks for the feedback and consideration! I'm happy to rework this as a standalone CLI tool if that seems like a better approach. I'm just motivated to make evals a more normal/expected part of MCP server implementations after installing so many servers that "work" when you just test the tools directly but fail miserably when an LLM tries to use them.

@QuantGeekDev
Copy link
Contributor

I was just thinking about building a standalone CLI - I need something like a curl equivalent for quick debugging. Currently I have a complex mcp-within-postman setup to share requests with my team. And then I got the email about this PR. +1 that this would make a great standalone package

@cliffhall
Copy link
Member

Our internal discussions are leaning toward suggesting your making this a personal project that possibly uses the inspector-cli package as a dependency.

The reason is, while it's definitely a useful thing, we aren't well suited to maintain it. We currently are not supporting any models in any way, only testing server functionality built on our SDKs. There are plenty of packages out there that do evals, so it would be a bit of a stretch for us to try and compete with them and support the issues that might arise from the use of model x vs model y, etc.

@cliffhall cliffhall closed this Jul 2, 2025
@steviec
Copy link
Author

steviec commented Jul 2, 2025

No problem, thanks for considering; I hear you on the maintenance overhead of something like this! I'll take a look at pulling this into a standalone CLI.

Part of the reason I built this was because I didn't like the architecture of the existing MCP server eval frameworks; they are all (AFAICT) married to particular SDK languages and usually extend whatever unit testing framework that particular SDK uses. Thinking of evals as a peer to what inspector does, completely decoupled from the implementation particulars, feels architecturally really useful and also gives it universal utility for MCP server builders. I'll keep you posted on whether I pull this off :).

@steviec
Copy link
Author

steviec commented Jul 24, 2025

Hi @cliffhall, I finally had enough bandwidth to pull out the evals framework into a standalone CLI tool called mcp-server-tester. Check it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants