How to build a real evaluation system for production AI products

The most common AI quality-assurance setup we see in production is one engineer running a few prompts through the new model version, eyeballing the outputs, and concluding “it seems fine.” This is what we call vibes-based QA, and it’s roughly as reliable as it sounds.

The problem isn’t that the engineers are sloppy. It’s that AI quality is genuinely hard to measure, and most teams haven’t invested in the infrastructure that makes measurement tractable. Here’s the system we set up on every AI project we ship and recommend every team running production AI build.

The hierarchy of evals

Useful evals come in three layers, with different costs and different signals.

Reference evals. A labeled dataset of inputs paired with known good outputs. You run the system, compare to the reference, score with a metric. Cheapest to run, hardest to build, most reproducible.

LLM-judge evals. A capable model is asked to grade your system’s output against a rubric. Faster to build than reference evals, more flexible (the rubric can capture nuance), more expensive to run (each eval is itself an LLM call), and reproducible only if the judge model and prompt are pinned.

Human evals. A human reviewer grades outputs. Most expensive, slowest, but the only way to validate whether your reference and LLM-judge evals are actually capturing what you care about.

You need all three. Human evals validate the LLM-judge evals. LLM-judge evals are what you run on every change. Reference evals are what you use for the kinds of correctness questions where there’s a single right answer.

The mistake teams make is trying to skip levels. They run a single LLM-judge call and trust its output. They build a reference set and don’t validate it against human judgment. They do human evals once and then never look at them again. The system breaks down without all three.

What to put in your eval set

The temptation is to put the easy cases in your eval set and watch them pass. This is the QA equivalent of testing only the happy path. Resist it.

A useful eval set has roughly this composition:

Common cases (50%). The boring, frequent queries that represent the main traffic. If these regress, real users will notice.
Edge cases (30%). Queries that historically broke the system. Off-distribution inputs, ambiguous queries, queries with multiple valid answers, queries with no good answer.
Adversarial cases (15%). Prompt injections, jailbreaks, attempts to make the model do things it shouldn’t, requests for confidential info, etc.
Regression cases (5%). Specific cases that broke before. Each represents a bug we’ve already paid for; the eval makes sure we don’t pay for it again.

The proportions matter. A team that only tests common cases will ship regressions on edge cases. A team that only tests adversarial cases will optimize for safety at the expense of usefulness. The mix above is what we’ve found balances signal across the dimensions that matter.

Aim for 100–300 cases as a starting eval set. You can scale up later, but the first 100 are where most of the value is.

Don’t trust LLM-judge scores in absolute terms

LLM-judge scores drift. A judge that gave 75/100 on a query last month might give 80/100 this month for the same output, simply because the judge model was updated or the prompt was tweaked. This means absolute scores from an LLM-judge are mostly meaningless. Relative scores — comparing two systems against the same judge with the same prompt — are useful.

The way we run LLM-judge evals: pin the judge model, version the rubric, and always evaluate “is the new output better than, worse than, or the same as the previous output for this case?” That comparison is robust. The 0–100 score on a single output is not.

For binary correctness questions, you can avoid this entirely. “Did this answer cite the right source?” is a yes/no. “Did this code compile?” is a yes/no. “Did the agent refuse to do the harmful thing?” is a yes/no. Binary judgments from LLM-judges are dramatically more reliable than fine-grained scores.

CI integration is what makes this work

Evals that aren’t in CI eventually stop being run. We’ve watched it happen multiple times: a team builds a sophisticated eval suite, runs it manually for a few weeks, and then one engineer forgets, and another engineer thinks the previous engineer ran it, and within a quarter nobody is running it.

Put eval runs on every PR. The runtime needs to be reasonable — under 5 minutes for the fast suite, with the full suite running nightly. Cache aggressively. Skip evals that aren’t affected by the change (e.g., if the change is a UI tweak, you don’t need to re-run model evals).

The output should be a comment on the PR showing pass/fail rates and which specific cases changed. If the new system is worse on five cases and better on three, the engineer needs to look at those cases and decide whether the tradeoff is acceptable. If it’s worse and not better, the change doesn’t ship.

Evaluation is product-specific

A common mistake: teams use generic benchmarks (MMLU, HumanEval) as their evaluation. These benchmarks measure things, but they’re not the things your product actually does.

Your eval set should reflect the queries your users actually send. The shortest path to a useful eval set is usually to take a representative sample of production traffic, hand-label them with what good outputs look like, and use those as the reference.

If you can’t ship to production yet (because you’re pre-launch), simulate it. Get a few people who match your target user profile to write the queries they’d send. The cost is hours; the resulting eval set is dramatically more useful than benchmarks borrowed from elsewhere.

Cost considerations

The thing teams forget about evals: they have a real ongoing cost. A 200-case eval set running on every PR, with each case making 2–3 LLM calls (the system + the judge), at $0.01–0.05 per call, is $4–30 per CI run. With 50 PRs a week, that’s $200–1500 a week — modest but not nothing.

The patterns that keep this manageable:

Use Haiku or 4o-mini for the judge unless you’ve validated that you need a bigger model.
Cache eval results that aren’t affected by the change.
Run a fast subset (50 cases) on every PR; run the full suite nightly.
Use prompt caching on the judge prompt if it’s stable.

We’ve found teams routinely overspend on evals by running too-frequent, too-expensive runs. The optimization is to make evals fast and cheap enough that they run on every change, not slow and expensive enough that they don’t.

Where to start

If your team is currently doing vibes-based QA and wants to move to something real:

Pick 30 representative production queries. Hand-label them. This is your first eval set.
Write 5 binary judges that capture the things you care about (correctness, safety, citation accuracy, refusal-when-appropriate).
Wire the eval suite to CI so it runs on every PR.
Run it for a month. Add cases for every regression you ship.
Once the suite is stable, start adding LLM-judge graded scores for nuanced quality dimensions.

Step 1 alone catches most regressions. The rest is optimization. Don’t let perfect be the enemy of good — a 30-case eval set that runs on every PR is dramatically better than the best eval set in the world that runs once a quarter.

If you’re operating production AI without a real eval system and want help setting one up, we do this work.

AI Evals: Beyond Vibes-Based QA

The hierarchy of evals

What to put in your eval set

Don’t trust LLM-judge scores in absolute terms

CI integration is what makes this work

Evaluation is product-specific

Cost considerations

Where to start

Bring us in for a 30-minute architecture call.

Related notes

AI for Enterprise Ops: Where the ROI Actually Lives

Cutting AI Costs by 10x Without Cutting Quality

How We Scope a Product in Five Days