Every team we talk to has built a RAG prototype. It works on the ten documents in the demo, answers questions impressively, and impresses the stakeholder. Three months later it is serving 8,000 queries a day and answering 30% of them incorrectly. The gap between a working prototype and an enterprise-grade RAG system is an engineering discipline problem, not a model problem.
We use a five-level maturity model internally to assess RAG implementations quickly. Each level represents a meaningful capability step, and each has characteristic failure modes that signal it has been reached — and not yet transcended.
Level 0 — Naive retrieval
Split documents into fixed-size chunks (512 tokens, overlapping 64). Embed with OpenAI ada-002. Retrieve top-k by cosine similarity. Pass to GPT-4. This works for demos. It fails for production because:
- Fixed-size chunking splits semantic units arbitrarily — a contract clause spanning a page boundary gets truncated
- ada-002 embeddings conflate semantic similarity with topical overlap — 'cat' and 'feline' score high, but so do documents that use the same keywords for different concepts
- Top-k retrieval has no quality floor — it returns the k least-wrong documents regardless of whether any of them are actually relevant
- No metadata filtering means a query about Q3 2024 revenue retrieves Q1 2022 documents
Level 1 — Structured chunking + metadata
Replace fixed-size chunking with semantic chunking (sentence boundaries, markdown heading boundaries, paragraph boundaries). Extract and store structured metadata — document type, date, author, section, source system — as filterable fields alongside the vector. Pre-filter by metadata before semantic retrieval.
Level 2 — Hybrid retrieval
Add BM25 (keyword) retrieval alongside dense vector retrieval. Fuse results with Reciprocal Rank Fusion. For enterprise knowledge bases, sparse retrieval dramatically outperforms dense retrieval for exact-match queries — product codes, names, regulatory references, numeric identifiers — which constitute a significant fraction of real enterprise queries.
def hybrid_retrieve(query: str, k: int = 20) -> list[Chunk]:
# Dense retrieval
q_emb = embed(query)
dense = vector_store.search(q_emb, k=k)
# Sparse retrieval
sparse = bm25_index.search(query, k=k)
# Reciprocal Rank Fusion
scores: dict[str, float] = {}
for rank, doc in enumerate(dense):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
for rank, doc in enumerate(sparse):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
fused = sorted(scores, key=scores.__getitem__, reverse=True)
return [chunk_store[id] for id in fused[:k]]Level 3 — Reranking + quality filtering
Run a cross-encoder reranker (Cohere Rerank, BGE-Reranker, or a fine-tuned bi-encoder) over the top-k fused results. Drop chunks below a quality threshold before passing to the LLM. This is the first level at which retrieval quality becomes measurably robust — the LLM context is genuinely high-precision rather than high-recall noise.
Level 4 — Query intelligence
Multi-step queries that require reasoning over multiple retrieved passages fail with single-pass retrieval. Level 4 adds query decomposition, sub-question generation, and iterative retrieval. The system asks itself: 'what intermediate facts do I need to answer this question?' and retrieves them in sequence before synthesizing.
Level 5 — Evaluation loop
The gap between Level 4 and Level 5 is not a technical one — it is an operational one. A Level 5 RAG system has: a ground-truth evaluation dataset (300+ labeled query-answer pairs, representative of production distribution), automated metrics running on every deploy (RAGAS faithfulness, answer relevance, context recall), and a feedback loop from production queries to evaluation data.
Most enterprise RAG systems never reach Level 5. They ship at Level 2 or 3, accuracy degrades as the corpus grows, and the team has no instrumentation to detect or diagnose the regression. The evaluation loop is not a nice-to-have — it is what separates a prototype that shipped from a system that works.