At a few hundred queries per day, inference cost is invisible. At 100,000 queries per day, it is a line item your CFO has opinions about. We have driven LLM inference costs down 60–80% for clients without degrading output quality — sometimes improving it. Here is exactly how.
Where the money actually goes
Before optimizing, instrument. Most teams assume their most expensive path is their complex reasoning queries. The actual cost breakdown is usually surprising: classification and routing calls — typically simple single-turn prompts — aggregate to 40–60% of total token spend because they run on every request. The expensive multi-step reasoning queries run on 5% of traffic.
- Profile token usage per request type, not per feature
- Track cost per user action, not just API spend totals
- Separate input cost from output cost — they differ by 3–5x per token on most frontier models
Optimization 1: Model routing
Route simple tasks — classification, extraction, yes/no decisions, summarization of short texts — to smaller, cheaper models. Route complex reasoning, multi-document synthesis, and open-ended generation to frontier models. The routing logic itself is a cheap classifier call.
In practice: a query classification call costs ~$0.0001 with Haiku or GPT-4o-mini. That classifier routes 70% of traffic to a model 10x cheaper than GPT-4o. The net saving on those queries is ~90%, net cost of classification overhead is ~3%. The math works.
ROUTING_PROMPT = """Classify this request as: simple | complex
simple = extraction, classification, yes/no, short summary (<500 tokens output)
complex = multi-step reasoning, synthesis, generation (>500 tokens output)
Request: {query}
Output only: simple or complex"""
async def route(query: str) -> str:
resp = await cheap_model.complete(ROUTING_PROMPT.format(query=query))
return resp.text.strip() # "simple" | "complex"
async def execute(query: str, ctx: Context) -> str:
model = CHEAP_MODEL if await route(query) == "simple" else FRONTIER_MODEL
return await model.complete(query, ctx)Optimization 2: Semantic caching
Exact cache hits are rare in LLM systems. Semantic caches match queries by embedding similarity — if a new query is within cosine distance 0.05 of a cached query, return the cached response. For enterprise knowledge bases and internal tooling, where many users ask semantically identical questions with different phrasing, cache hit rates of 20–40% are achievable.
Optimization 3: Prompt compression
RAG system prompts carry retrieved context that is often 60–80% of total tokens. LLMLingua and similar prompt compression models remove redundant tokens from context while preserving the information the LLM needs. In our implementations, 4:1 compression with <2% answer quality degradation is reliably achievable. At high volume this alone cuts 30–40% of total cost.
Optimization 4: Async batching
Background processing jobs — document indexing, nightly summarization, batch analysis — do not need to run as individual real-time requests. Batch API endpoints (Anthropic, OpenAI) offer 50% cost reduction for async workloads. If your background pipeline is using synchronous inference, you are paying twice as much as necessary.
Optimization 5: Output token discipline
Output tokens cost more than input tokens and are the most controllable variable in your system. Add explicit length instructions to every prompt where output length matters. 'Respond in 2–3 sentences' is not a style preference — it is a cost control. Add max_tokens limits at the API level as a hard ceiling. Log output token distribution per prompt type and add alerts when p95 output length increases significantly.
What not to do
- Do not sacrifice accuracy for cost on customer-facing flows — the customer churn from wrong answers costs more than the model bill
- Do not compress system prompts below the point where the model follows instructions reliably
- Do not build a custom caching layer when a vector store you already have supports ANN search
- Do not optimize prematurely — instrument first, then optimize the top 20% of cost drivers