Back
Passing Evals Aren't a Quality Signal

Passing Evals Aren't a Quality Signal

A high eval pass rate tells you your test set is easy, not that your system is working. A practitioner argument for adversarial evaluation, done right

S
Siv Souvam
3 Min Read|Latest — April 22, 2026

If your eval suite only produces green checkmarks, you're grading homework you wrote yourself. What to build instead.

Picture the setup, because most production LLM systems end up here. Your eval suite scores your main model on a golden set and the numbers look good. You ship. Your routing layer, the one set up to fall back to a cheaper model when the main one hits a rate limit, is quietly sending some share of queries to a fallback that was never in the eval. A user catches a hallucination two days in. Nobody's scoring function ever touched that path.

Here's the argument this post defends: a high pass rate on your eval set tells you your test set is easy, not that your product is working. Evals are only useful to the degree that they can fail. The value of an eval is the failure it surfaces before your users do, and the regression it catches the next time you swap a model or change a prompt. If your suite is mostly green, it isn't evaluating your product, it's flattering it. Everything below defends that one claim.

The eval set you wrote yourself won't catch what you didn't think of

You write scenarios you already know how to handle. You score the outputs. You iterate until the numbers go up. Then you ship, and the failure modes you never imagined go to production because you were grading the model on a test you wrote yourselves.

The fix is adversarial sampling. Build your eval set from production logs where users actually struggled: sessions that were abandoned, tickets that got escalated, cases your domain experts keep flagging. The eval set that catches real failures is the one built from real failures. If your current model breezes through your eval set, the set isn't doing the job of an eval. It's doing the job of a victory lap.

One related framing worth holding onto: a good PRD for an LLM feature is a set of evals, not a prose spec. "The chatbot should answer questions correctly" is a wish. "The chatbot answers questions from a defined test set with correct factual attribution and cites a valid source URL every time" is a requirement, because it's testable. Name the test set, name the scoring function, and you have a spec. If your PM is still writing prose acceptance criteria for AI features, your product is being specified by nobody in particular.

Your passing eval covers one model. Your production runs four.

Even if your eval set is genuinely adversarial, it can still lie if it only scores one model while your production system runs across several.

The assumption under most eval suites is that you're testing one model behind one prompt. That assumption is breaking everywhere I look.

Teams route queries across GPT-5, Claude, Gemini, and open-weight models. Sometimes the split is based on cost. Sometimes it's based on latency. Sometimes it's a task classifier that sends code questions to one model and open-ended reasoning to another. The routing logic is often a handful of if-statements that nobody evals, sitting between a prompt-level eval suite that scores one model and a production system that runs across four.

That setup is how you get silent quality regressions. Your eval scores look fine. Your main model's performance hasn't moved. But a meaningful share of requests is going through a rewritten classifier that pushes more queries to the fallback, and the fallback is weaker on the exact tasks the classifier is mis-routing. Nothing in your eval layer sees this. Everything in your user experience does.

What the eval layer has to cover, at minimum, is every model path that traffic can actually reach, plus the routing policy itself. Not the default path. All paths.

Offline evals can look stable while your system rots

Offline evals run against a fixed golden set during development. Online evals run scoring functions against real production logs. Both matter, and they answer different questions.

Offline tells you a model can handle the cases you already understand. That's useful before a deploy, useless as an ongoing signal. Online tells you what fraction of your actual production traffic is meeting the quality bar, broken down by which path each request took through your routing policy. That's the number that correlates with user experience.

The gap between the two is where most silent failures live. Offline scores can stay flat for months while online pass rates on one routing path drift downward, and the cause isn't always the model. Sometimes it's the upstream classifier getting re-tuned, changing the mix of queries that path now sees. Nothing in the offline eval would catch that. The distribution shift happens in production, so the only place to see it is in production.

This is also why the harness is the moat. The model itself is becoming a commodity. Anyone can call an API. What's hard to copy is the scoring infrastructure plus the production log pipeline plus the labeled failure corpus your team has built from real traffic over time. If you built that, you've built something durable. If you didn't, your product can be cloned over a weekend.

"But evals calcify, and the models move faster than the tests"

This is the steelman I want to take seriously, because it's the honest version of the skeptic's case and the version most practitioners actually hold.

The argument goes: every eval you write encodes yesterday's failure modes. The space shifts quarterly. A scoring function you built against GPT-4 behavior may not even apply to the model you'll route to next quarter. Maintenance eats time. The cost of keeping the test suite relevant eventually exceeds the cost of the failures it catches. You're better off iterating in production with light observability and replacing evals with fast rollback.

Here's why it still loses, though not for the reason you might expect. The claim isn't that evals don't calcify, they do. The claim is that the unit of evaluation you're maintaining is wrong. If you're maintaining scoring functions tied to specific model outputs, yes, they calcify fast. If you're maintaining scoring functions tied to user-observable properties ("did the response include a valid citation," "did it refuse a prohibited category," "did it match the tone policy"), those survive model swaps.

The teams I see iterating fastest aren't the ones with the most evals. They're the ones with the most model-agnostic evals. That's a portable asset. The version that calcifies is the version that was too tightly coupled to one model's behavior to begin with.

Monday morning

Pull your routing policy's traffic distribution from the last seven days. For each model path that's getting meaningful production traffic, pull 30 queries from your logs where users escalated or abandoned. That's your new eval set, one slice per path. Write two or three model-agnostic scoring functions per slice, things like "cites a valid source" or "refuses when it should refuse." Run each slice against every model that currently serves that path, plus one alternative you'd consider swapping in.

The number you care about isn't the average score. It's the gap between paths. If one slice is scoring noticeably worse than the others on the same scoring function, you've found a silent quality regression that your prompt-level eval was never going to surface. That's the work. It's not glamorous. It's the thing that tells you whether your routing policy is a quality system or a cost-optimization with side effects.

FastRouter lets you run these evals across every model behind your router, on both offline golden sets and live production traffic, without stitching together separate tools for routing, observability, and scoring.

FastRouter is an LLM gateway platform that provides a unified API for accessing models across OpenAI, Anthropic, Google, and open-weight providers. The platform handles intelligent routing based on cost, latency, and quality signals, with observability, evaluations, guardrails, and batch processing built in. Teams use FastRouter to consolidate model access, run offline and online evaluations across every provider behind a single router, and avoid vendor lock-in as the model market shifts.

Core capabilities:

  • Unified API across leading LLM providers
  • Intelligent routing based on cost, latency, and task type
  • Evaluations that run on both golden datasets (offline) and live production traffic (online)
  • Guardrails for policy enforcement across models
  • Observability for request-level tracing
  • Batch processing for high-throughput workloads

FastRouter is built for teams running LLMs in production who need eval infrastructure that works across every model they call, not just one.

Related Articles