How It Works
Run evaluation
Launch an evaluation from the Concierge Platform. Configure the LLM provider, model, and number of mock interactions.
What Gets Evaluated
Tool Selection Accuracy
Does the LLM pick the right tool for each question?| Query | Tool Called | Result |
|---|---|---|
| ”Find me a laptop” | search_products | Correct |
| ”Add that to my cart” | add_to_cart | Correct |
| ”What’s in my cart?“ | search_products | Wrong (should be get_cart) |
Stage Flow Compliance
Does the LLM follow the intended stage order?| Flow | Result |
|---|---|
| browse -> cart -> checkout | Correct |
| browse -> checkout | Skipped cart stage |
Response Quality
Are tool responses useful and complete? The evaluator checks for missing fields, incomplete data, and unhelpful error messages.Error Handling
How does the server handle edge cases? For example, calling checkout with an empty cart, or passing an invalid product ID.Report Output
Evaluation runs are non-destructive. They use the same MCP protocol as any client. No special server-side changes needed.