All articles
Cost Engineering8 minApr 14, 2026

LLM Cost Optimization at Enterprise Scale: A Practical Guide

When you're processing millions of tokens per day, inference cost becomes an engineering constraint. These patterns cut LLM costs 60–80%.

At a few hundred queries per day, inference cost is invisible. At 100,000 queries per day, it is a line item your CFO has opinions about. We have driven LLM inference costs down 60–80% for clients without degrading output quality — sometimes improving it. Here is exactly how.

Where the money actually goes

Before optimizing, instrument. Most teams assume their most expensive path is their complex reasoning queries. The actual cost breakdown is usually surprising: classification and routing calls — typically simple single-turn prompts — aggregate to 40–60% of total token spend because they run on every request. The expensive multi-step reasoning queries run on 5% of traffic.

  • Profile token usage per request type, not per feature
  • Track cost per user action, not just API spend totals
  • Separate input cost from output cost — they differ by 3–5x per token on most frontier models

Optimization 1: Model routing

Route simple tasks — classification, extraction, yes/no decisions, summarization of short texts — to smaller, cheaper models. Route complex reasoning, multi-document synthesis, and open-ended generation to frontier models. The routing logic itself is a cheap classifier call.

In practice: a query classification call costs ~$0.0001 with Haiku or GPT-4o-mini. That classifier routes 70% of traffic to a model 10x cheaper than GPT-4o. The net saving on those queries is ~90%, net cost of classification overhead is ~3%. The math works.

python
ROUTING_PROMPT = """Classify this request as: simple | complex
simple = extraction, classification, yes/no, short summary (<500 tokens output)
complex = multi-step reasoning, synthesis, generation (>500 tokens output)

Request: {query}
Output only: simple or complex"""

async def route(query: str) -> str:
    resp = await cheap_model.complete(ROUTING_PROMPT.format(query=query))
    return resp.text.strip()  # "simple" | "complex"

async def execute(query: str, ctx: Context) -> str:
    model = CHEAP_MODEL if await route(query) == "simple" else FRONTIER_MODEL
    return await model.complete(query, ctx)

Optimization 2: Semantic caching

Exact cache hits are rare in LLM systems. Semantic caches match queries by embedding similarity — if a new query is within cosine distance 0.05 of a cached query, return the cached response. For enterprise knowledge bases and internal tooling, where many users ask semantically identical questions with different phrasing, cache hit rates of 20–40% are achievable.

Optimization 3: Prompt compression

RAG system prompts carry retrieved context that is often 60–80% of total tokens. LLMLingua and similar prompt compression models remove redundant tokens from context while preserving the information the LLM needs. In our implementations, 4:1 compression with <2% answer quality degradation is reliably achievable. At high volume this alone cuts 30–40% of total cost.

Optimization 4: Async batching

Background processing jobs — document indexing, nightly summarization, batch analysis — do not need to run as individual real-time requests. Batch API endpoints (Anthropic, OpenAI) offer 50% cost reduction for async workloads. If your background pipeline is using synchronous inference, you are paying twice as much as necessary.

Optimization 5: Output token discipline

Output tokens cost more than input tokens and are the most controllable variable in your system. Add explicit length instructions to every prompt where output length matters. 'Respond in 2–3 sentences' is not a style preference — it is a cost control. Add max_tokens limits at the API level as a hard ceiling. Log output token distribution per prompt type and add alerts when p95 output length increases significantly.

What not to do

  • Do not sacrifice accuracy for cost on customer-facing flows — the customer churn from wrong answers costs more than the model bill
  • Do not compress system prompts below the point where the model follows instructions reliably
  • Do not build a custom caching layer when a vector store you already have supports ANN search
  • Do not optimize prematurely — instrument first, then optimize the top 20% of cost drivers

Built something like this? We can help.

These patterns come from real production systems.

Start a conversation