Skip to main content
Evaluation lets you automatically test your MCP server by simulating real user interactions. It generates mock questions, runs them through your tools, and produces detailed reports with suggestions.

How It Works

1

Deploy your server

Deploy your MCP server with concierge deploy.
2

Run evaluation

Launch an evaluation from the Concierge Platform. Configure the LLM provider, model, and number of mock interactions.
3

Review the report

Get a detailed breakdown of how your server handled each interaction.

What Gets Evaluated

Tool Selection Accuracy

Does the LLM pick the right tool for each question?
QueryTool CalledResult
”Find me a laptop”search_productsCorrect
”Add that to my cart”add_to_cartCorrect
”What’s in my cart?“search_productsWrong (should be get_cart)

Stage Flow Compliance

Does the LLM follow the intended stage order?
FlowResult
browse -> cart -> checkoutCorrect
browse -> checkoutSkipped cart stage

Response Quality

Are tool responses useful and complete? The evaluator checks for missing fields, incomplete data, and unhelpful error messages.

Error Handling

How does the server handle edge cases? For example, calling checkout with an empty cart, or passing an invalid product ID.

Report Output

{
  "summary": {
    "total_interactions": 20,
    "success_rate": 0.85,
    "avg_response_time_ms": 145,
    "stage_compliance": 0.95
  },
  "suggestions": [
    "Add a 'get_cart' tool: users frequently ask about cart contents",
    "checkout tool should validate cart is non-empty before processing",
    "search_products should handle empty query gracefully"
  ],
  "interactions": [...]
}
Evaluation runs are non-destructive. They use the same MCP protocol as any client. No special server-side changes needed.