Practical AI cost optimization: prompt caching, model routing, distillation

Most AI products we audit are paying somewhere between 5× and 10× what they need to be paying. The reasons are remarkably consistent across companies — same handful of patterns, same handful of fixes, very few of which are surprising once you see them.

This is the cost-optimization checklist we walk through with clients. It’s ordered by leverage: the early items are usually the ones that move the bill the most.

Prompt caching is free money

If you’re using Anthropic’s API and not using prompt caching, you’re paying full price for tokens that should be costing you 10% of that. The mechanism: cached prompt prefixes get charged at roughly 10% of the normal input rate on subsequent requests within the cache window.

The pattern that works: identify the parts of your prompts that don’t change between requests — system instructions, document context for RAG, few-shot examples — and structure them to be cacheable. The user’s actual question goes at the end. If your prompt structure is “system → context → user”, with stable system and context, the cache hit rate on production traffic is typically 80%+ once you’ve been running long enough to warm caches.

We’ve seen client bills drop by 60% in a week from prompt-caching alone. The engineering effort is hours, not days.

OpenAI has equivalent functionality (automatic on long prompts as of 2024). Google’s Gemini has implicit caching. The takeaway is the same: structure your prompts for cache friendliness, and your bill drops.

Stop sending everything to the biggest model

The next pattern we see: every request, regardless of complexity, is going to GPT-5 or Claude Opus. The team picked the best model in development and never revisited it.

Most production traffic doesn’t need the best model. A lot of traffic — classification, simple extractions, content moderation, query rewriting, intent detection — works perfectly well on a small, fast, cheap model (Haiku, GPT-5-mini, Gemini Flash). Some of it is better on the smaller model because the smaller model is fast and consistent.

Build a router. The router is itself a small classifier (rules, embeddings + cosine similarity to known query types, or a small LLM call) that picks the right model for the request. Easy queries go to Haiku at $0.80/MTok. Hard queries go to Opus at $15/MTok. The cost ratio between those is 18×. Even if half your traffic moves to the small model, your weighted average cost halves.

We typically aim for 70–80% of production traffic running on the smallest reasonable model. Most teams hit this target and discover quality is unchanged or improved (because the small model is faster and the user experience is better).

Output tokens are 5× the cost of input

Pricing is usually asymmetric: outputs cost roughly 5× more per token than inputs. So a prompt that produces a long answer is dramatically more expensive than a prompt that produces a short one.

This becomes a design decision in the product. Are we asking the model for a paragraph when a sentence would do? Are we letting the model emit chain-of-thought tokens that the user never sees? Are we using JSON output with a schema that includes optional fields the model fills in by default?

Three concrete moves that always pay off:

Set max_tokens aggressively. Half of all production requests we audit have max_tokens set to 4096 or higher with no business reason. Cut it to what you actually need.
Use structured output formats. JSON schemas with required fields force the model to emit only what’s needed. Optional fields tend to get filled in.
For RAG, generate concise answers. The user doesn’t need the model to repeat the retrieved context back to them. Tell the model “in 100 words” and watch the bill drop.

Batch inference for non-realtime work

Most product traffic is real-time and pays real-time rates. But a surprising amount of AI workload doesn’t actually need a sub-second response — embedding generation, document processing pipelines, async summarization, eval grading, content moderation on user-generated content.

Anthropic and OpenAI both offer batch APIs at roughly 50% off the real-time rate. The catch is that completion takes up to 24 hours. For workloads where that’s acceptable, you cut the cost in half by definition.

The pattern we use: anything in your system that runs on a queue should be evaluated for batch eligibility. If the consumer of the output can wait an hour, it can wait 24 hours. The cost saving is automatic.

Embedding caches are nearly free wins

If you’re computing embeddings repeatedly for the same content (or near-duplicate content), you’re spending money on a deterministic function. Cache the result.

The cache key is a hash of the input text. The cache value is the embedding vector. When a new document or query comes in, hash first, look up, generate only on miss. For production RAG systems, hit rates above 90% are common — the same documents get re-embedded across re-indexing runs, the same user queries get repeated.

Same applies to cached LLM responses for FAQ-style traffic. If 30% of your support bot’s queries are some variant of “how do I reset my password”, you should be answering that from a cache, not the LLM, on most invocations.

Streaming reduces perceived cost without reducing actual cost

This is a subtle one. Streaming doesn’t reduce your bill. It reduces the time-to-first-token, which improves user experience, which reduces the rate at which users retry the request because they thought it was hung. Retries are a hidden cost that compounds.

Streaming costs the same as non-streaming. Use it everywhere the user is waiting for output. The retry rate drop alone is meaningful.

Eval-driven optimization

The last and most important piece: you can’t optimize what you can’t measure. Without an eval suite, every cost optimization is a leap of faith. With one, you can confidently make changes and verify quality didn’t drop.

Build an eval set of 100–200 representative production queries with known good outputs. Run it against your current setup. Run it again after every cost-cutting change. Anything that improves cost without dropping eval scores ships. Anything that drops eval scores either gets reverted or comes with an explicit, documented quality tradeoff.

We’ve watched teams chase a 20% cost saving and lose 5% eval accuracy without realizing it. The eval suite catches this in development, not in customer support tickets three weeks later.

The order to attack

If you’re starting from scratch on an existing AI product, the order we’d suggest:

Eval suite first (a day or two). Without it, none of the other steps are safe.
Prompt caching (a few hours). Highest ROI per hour of any item on this list.
Output token limits (an hour). Audit max_tokens across the codebase, tighten everywhere.
Model routing (a few days). Build a tiered routing system, move easy traffic to small models.
Batch APIs for async work (a day). Audit what queues exist, move eligible ones to batch.
Embedding caches (a day). For RAG-heavy products especially.
Output structure tightening (ongoing). Look for places the model is emitting more than the product needs.

The compounding effect is significant. Each change is 20–60% on a slice of traffic. Done together, getting from “bill is too high” to “bill is reasonable” usually takes a single sprint.

If you’re paying for AI inference and want a structured cost audit, we run this exercise frequently.

Cutting AI Costs by 10x Without Cutting Quality

Prompt caching is free money

Stop sending everything to the biggest model

Output tokens are 5× the cost of input

Batch inference for non-realtime work

Embedding caches are nearly free wins

Streaming reduces perceived cost without reducing actual cost

Eval-driven optimization

The order to attack

Bring us in for a 30-minute architecture call.

Related notes

AI for Enterprise Ops: Where the ROI Actually Lives

AI Evals: Beyond Vibes-Based QA

How We Scope a Product in Five Days