RAG in production is mostly retrieval, not generation

Many teams shipping retrieval-augmented generation today started from the same five-line tutorial: chunk your documents, embed them into a vector store, embed the user’s question, fetch the nearest hits, hand them to an LLM. It works beautifully in a notebook. It also has almost nothing to do with what a real RAG product looks like once it has users.

In production, the LLM on top is the cheapest and most interchangeable piece of the system. What actually decides whether the product is useful — whether it finds the right document, ignores the wrong ones, respects who is allowed to see what, and stays current as the underlying data changes — is the retrieval layer. And the retrieval layer is not “AI.” It is a piece of data infrastructure: indexes, change streams, derived tables, sharding, eval.

flowchart LR
    Q[User query] --> EMB[Embed]
    EMB --> ANN[Vector DB<br/>top-k]
    ANN --> P[Prompt]
    P --> LLM[LLM]
    LLM --> A[Answer]

    style ANN stroke:#c62828,stroke-width:3px
    style LLM stroke-dasharray: 4 4

The red box is where almost all the quality lives. The dashed box is what most teams spend their time on.

This post walks through what each piece of a real RAG system actually does, why the naive version breaks the moment you point it at a real corpus, and how every piece maps onto patterns that backend and data engineers already know. The audience I have in mind is a working software engineer — you do not need an ML background, and any piece of infrastructure jargon I lean on (CDC, sharding, inverted indexes, materialized views) gets a one-line gloss the first time it shows up. The argument I want to make is that RAG is much more of a search-and-data-infrastructure problem than an “AI” problem.

Quick vocabulary, anchored to things you already know

  • Embedding — a function that takes a piece of text and turns it into an array of numbers. The array always has the same length, no matter whether the input is one word or one page. Something like:

    embed("annual fees")
    # -> [ 0.0123, -0.2450, 0.0871, ..., -0.0418]   # always exactly 1024 numbers
    embed("yearly charges")
    # -> [ 0.0119, -0.2401, 0.0863, ..., -0.0405]   # also 1024 numbers; very close to the one above
    

    Texts that mean similar things produce arrays that are close to each other (small numeric differences). Texts about unrelated things produce arrays that are far apart. Think of it as a learned hash function — except instead of avalanching similar inputs to random outputs the way a normal hash does, this one deliberately puts similar inputs near each other.

  • Vector — just the name for one of those arrays of numbers. A 1024-dim vector is an array of 1024 numbers. Nothing more exotic than that.
  • Vector database — a database whose superpower is one specific query: “given this vector, find the stored vectors most similar to it.” Plays the same architectural role that a B-tree plays in a relational DB — just over arrays of numbers instead of sortable columns. Many vector DBs are layered on top of engines you already know: pgvector is a Postgres extension; Qdrant uses RocksDB underneath; Pinecone is custom. From the outside they look like a normal CRUD API.
  • Top-k — the search asks for the k closest matches and nothing more. k=10 means “give me the 10 nearest neighbors.” It is a knob, not a concept.
  • ANN (Approximate Nearest Neighbor) — brute-force comparing the query against all 50M stored vectors would be O(N) per query, which falls over at scale. ANN trades a small amount of accuracy for much faster search (sub-linear in N). Same kind of trade as any other database index: pay some storage and write cost to avoid a full scan.
  • HNSW (Hierarchical Navigable Small World) — the ANN algorithm running inside almost every vector DB. The data is stored as a graph: each vector has edges pointing to a handful of its closest neighbors. A query starts at some entry point in the graph and walks greedily toward closer and closer neighbors until it cannot improve. Roughly log(N) hops to converge. Same intuition as a skip list or a routing table — fan out from anywhere, converge quickly to the target.
  • Recall@k — your accuracy metric. Of the documents that should have come back for a given query (judged by a human or a labeled set), what fraction actually made it into your top-k? recall@10 = 0.9 means: 90% of the time, at least one truly-correct document was in the top-10. This is the cache-hit-rate of search — the number you optimize.
  • Eval set — a list of (query, [correct_doc_ids]) pairs you’ve curated ahead of time. You run your pipeline against each query and check how often the correct doc appears in the top-k. It plays the same role as a regression test suite. Without one, you cannot tell whether a change helped or hurt — you are guessing.

With those in hand:

Where the naive pipeline breaks

On 3,000 clean Markdown pages, the naive pipeline above gets ~75% acceptable on a hand-written eval. Point it at a 4-million-document enterprise SharePoint and you’ll hit every one of these failures in the first week. They are concrete; this is what actually happens:

  • The right document is in the corpus but never makes it into the top-k. A user asks “what is our refund window for enterprise customers?” The answer is one sentence buried in a 90-page contract PDF. After chunking, that sentence ends up in a 512-token chunk that also contains 400 tokens of legal boilerplate. When the embedder averages the meaning of all 512 tokens into one vector, the boilerplate dominates. The query’s embedding doesn’t land close to that chunk. The document exists; the search just can’t find it.
  • The top-10 is ten near-duplicate copies of the same policy from five org units. Engineering, Sales, HR, Legal, and Finance each have their own copy of the refund-policy doc with two-word differences. All ten slots get eaten by the same content; the one chunk that would have actually answered the question is at rank 11.
  • Multi-turn falls apart. The user asks “and for enterprise customers?” as a follow-up. Embedded literally, that phrase has no semantic content — it could be about anything — and lands somewhere random in vector space.
  • Permissions leak. A user from team A submits a query, the vector search ignores who’s asking, and returns a confidential doc that belongs to team B. The LLM helpfully quotes from it.
  • Quality regresses and nobody can tell why. Users complain. Is it the embedder? The chunking? The prompt? The model? Without per-stage metrics, you cannot localize the regression and you cannot fix it.

Every one of these is a retrieval problem. Swapping GPT-4o-mini for Claude Opus doesn’t move any of them.

Chunking: deciding what a row in the index even is

A single row in the vector DB is one chunk of text plus its embedding. You don’t store whole documents — embedding a 50-page PDF as one vector smears all of its meaning into a single point and matches nothing well. So you split.

How you split looks like a config knob — every tutorial just shows chunk_size=512, chunk_overlap=50 and moves on. It is actually a system design decision. The chunk is the smallest unit your search can ever return. Pick badly, and every improvement downstream has a ceiling.

The pattern that wins on long documents is parent-child chunking. Like a covering index plus a row store. The vector DB holds small chunks (256 tokens) because small chunks have sharp, focused embeddings. A separate document store holds larger parent passages (say 2,000 tokens) because the LLM needs surrounding context to actually answer the question. The two are joined by a parent_id foreign key.

flowchart TD
    D[Document<br/>~8000 tokens] --> P1[Parent chunk<br/>~2000 tokens]
    D --> P2[Parent chunk<br/>~2000 tokens]
    D --> P3[Parent chunk<br/>~2000 tokens]
    P1 --> C1[child 256t]
    P1 --> C2[child 256t]
    P1 --> C3[child 256t]
    P2 --> C4[child 256t]
    P2 --> C5[child 256t]
    P3 --> C6[child 256t]
    P3 --> C7[child 256t]

    C1 -.embed.-> V[(Vector index<br/>indexed on children)]
    C2 -.embed.-> V
    C3 -.embed.-> V
    C4 -.embed.-> V
    C5 -.embed.-> V
    C6 -.embed.-> V
    C7 -.embed.-> V

    V -. match on C4 .-> R[Return parent P2<br/>as the actual context]

    style C4 stroke:#c62828,stroke-width:2px
    style R stroke:#c62828,stroke-width:2px

The chunker writes both sides:

def parent_child_chunks(doc, child_tokens=256, parent_tokens=2000):
    parents = split_by_headings(doc, max_tokens=parent_tokens)
    out = []
    for parent_id, parent in enumerate(parents):
        for child in split_tokens(parent.text, child_tokens):
            out.append({
                "child_text": child,           # what gets embedded
                "parent_id": parent_id,        # foreign key
                "parent_text": parent.text,    # written once to doc store
                "heading_path": parent.heading_path,
                "source_id": doc.id,
            })
    return out

Only the child gets embedded. The parent text goes into a normal key-value or row store. At query time you do an index lookup, then a join:

# 1. find the most relevant CHILDREN by embedding similarity
child_hits = vector_db.search(embed(query), top_k=10)
# [{"child_id": 4823, "parent_id": 12, "source_id": "policy.pdf"}, ...]

# 2. dedupe up to the PARENTS those children point to
parent_ids = {h["parent_id"] for h in child_hits}
parents    = doc_store.get_many(parent_ids)  # fetch full parent_text

# 3. ship the PARENT text (not the child) to the LLM
context = "\n\n".join(p["parent_text"] for p in parents)

Two more things matter beyond the split:

  1. Prefix the heading path onto the embedded text. A chunk that reads "Customers may cancel within 30 days" is ambiguous in isolation. Embedding "Billing > Refunds > Enterprise\n\nCustomers may cancel within 30 days" instead is not. Free recall gain, zero infra cost.
  2. Type-aware splitters. Never split inside a function body in source code. Never break a table row apart. Repeat column headers on every chunk of a wide table. These move retrieval quality more than swapping embedding models does.

A serious corpus is typed — code, PDFs, support tickets, transcripts, tables — and each type wants its own splitter. The splitter is the same kind of engineering artifact as any other piece of ingest code: versioned, tested, monitored.

Embeddings: pick a model, then move on

Which embedding model you pick matters less than three operational decisions about how you store the result.

Dimension and storage cost. A 3072-dim float32 vector is 12 KB on disk (3072 numbers × 4 bytes each). Multiply that out across 50M chunks and you are looking at 600 GB before the HNSW graph’s overhead, which typically adds another ~2x on top.

Most modern embedders support MRL — Matryoshka Representation Learning — a training trick that makes the first 256 floats of the model’s 3072-float output usable as a (slightly lower quality) embedding on their own. Think Russian nesting dolls: every prefix is itself a valid embedding. Store only the first 512 floats and you keep most of the quality at 6x less storage. Models that do this: OpenAI’s text-embedding-3-*, Nomic Embed v1.5.

v = openai.embeddings.create(model="text-embedding-3-large", input=text).data[0].embedding
v = v[:512]              # MRL truncation: keep first 512 of 3072
v = v / np.linalg.norm(v)  # normalize (see below)

Normalize your vectors. The standard similarity measure between two embeddings is cosine similarity — the angle between the two vectors. For speed, vector DBs implement this as a dot product: multiply pairwise, sum the products. The catch: dot product equals cosine only if both vectors have length 1.

If you skip the normalization step (v / np.linalg.norm(v)), longer vectors mechanically score higher against everything just for being longer. Your top-k starts coming back biased toward the same handful of long chunks regardless of query — like a SQL ORDER BY relevance that is secretly sorting by length(text).

The fix is one line. A surprising fraction of production systems get this wrong.

Tag every vector with the model that produced it. Embeddings from different models live in incompatible coordinate systems and cannot be compared. The day you upgrade text-embedding-3-smalllarge, every existing vector is dead and needs re-embedding. If you didn’t tag at write time, you cannot tell which model produced which row, and you cannot do a safe cutover:

chunk_record = {
    "id": chunk_id,
    "vector": v,
    "embedder": "text-embedding-3-large",
    "embedder_version": "2024-01",
    "dim": 512,
}

Then you can dual-write to a v1 and v2 index in parallel, leave reads on v1 until v2’s offline eval beats it, then flip:

flowchart LR
    W[Writer] --> V1[(Index v1<br/>small / 1536d)]
    W --> V2[(Index v2<br/>large / 3072d)]
    Q[Query] --> S{Cutover<br/>flag}
    S -->|reads| V1
    S -. shadow reads .-> V2
    V2 -. compare offline .-> EVAL[(Eval)]

    style V2 stroke:#c62828,stroke-width:2px

Same pattern as any other online-schema-migration with dual writes.

The model upgrade itself is a real expense: 50M chunks × ~400 tokens × $0.13/M tokens ≈ $2,600 and 12–48 hours of throughput. Plan for it on day one.

Domain fine-tuning of the embedder on (query, relevant-chunk) pairs harvested from your own logs is the single biggest quality lever once everything else is solid. It needs a real eval set — which loops back to the unglamorous evaluation work below.

The vector index, and the recall cliff that bites every team

This section is about what is happening inside the vector DB when you ask it for the top-k nearest matches — what controls quality, what controls latency, and the one failure mode that surprises every team the first time they hit it in production.

HNSW is the algorithm doing the actual work. As mentioned in the vocabulary: it’s a graph where each vector points to its nearby neighbors, and a query walks the graph greedily until it cannot get any closer. Three knobs control its behavior:

import faiss
d = 512
index = faiss.IndexHNSWFlat(d, M=32)    # M: edges-per-node. Higher M = better
                                        # recall, more RAM (~linear).
index.hnsw.efConstruction = 200         # how thoroughly to search when building.
                                        # Higher = better graph, slower one-time build.
index.add(vectors)

index.hnsw.efSearch = 100               # at query time: how many candidates to
                                        # consider before stopping. Main quality/
                                        # latency knob.
D, I = index.search(query_vec, k=10)

efSearch is the lever you tune in production. Going from 50 → 200 typically moves recall@10 from ~92% → ~99% at 2–4x the query latency. On a 50M-vector index, that’s the difference between p99 of 40 ms and p99 of 180 ms. Same shape as any caching trade-off.

Now the part that catches every team eventually. Real queries are almost never “give me the 10 closest chunks period.” They’re “give me the 10 closest chunks where tenant_id='acme' AND document_type='policy'. That sounds harmless. Here’s what actually happens:

The graph walk visits up to efSearch candidates (say 100). It doesn’t know about your filter while walking. So it visits 100 candidates — and 97 of them happen to belong to other tenants. After applying your filter you have 3 results. But you asked for 10. The vector DB returns those 3 silently. No error. No warning. Your recall@10 just dropped 70% and you have no way to tell.

flowchart LR
    Q[Query] --> H[HNSW traversal]
    H --> C1((candidate))
    H --> C2((candidate))
    H --> C3((candidate))
    H --> Cn((... efSearch=100<br/>candidates visited))
    C1 -. filter:<br/>tenant_id=X .-> X1[reject]
    C2 -. filter .-> X2[reject]
    C3 -. filter .-> M1[match]
    Cn -. filter .-> X3[reject]
    M1 --> R[Return<br/>only 3 hits for k=10<br/>silently]

    style R stroke:#c62828,stroke-width:2px

The closest relational analog: imagine an index scan that the planner stops after looking at 100 rows. Now add a WHERE tenant_id='X' predicate that matches 1% of rows. The scan visits 100, only 1 passes the filter, you get 1 row back. Postgres wouldn’t do this — it would either use a different index or scan more. HNSW doesn’t have those options at query time.

The more selective the filter, the worse this gets. A filter matching <0.1% of the corpus can return zero results when the unfiltered query would have returned plenty.

Fixes, cheapest to most expensive:

  • Crank efSearch up. Cheap; breaks down at very selective filters.
  • Use a vector DB with proper filtered-HNSW support (Qdrant, Weaviate). Better, still imperfect.
  • Partition the index physically by the dominant filter. If 99% of queries filter by tenant_id, give each tenant its own physical index. The “filter” becomes routing, the way a sharded relational DB routes by tenant_id to the right shard. Recall is bounded again because each shard contains only matching rows.

Hybrid retrieval: dense alone almost never wins

The single highest-leverage architectural decision: run two indexes in parallel — a vector index and a full-text (BM25) index — and merge the results.

The two have opposite failure modes:

  • Dense (vector / embedding-based) is good at meaning. It matches "yearly charges" to a chunk about "annual fees" because their embeddings are close. It fails on exact tokens it has never seen — like the error code ERR_OUTLOOK_4823 — because that token never appeared in pretraining and embeds to noise.
  • Lexical (BM25) is the algorithm behind Elasticsearch / OpenSearch / Lucene. It’s an inverted index — same structure as the full-text indexes you’ve used for 20 years. Finds ERR_OUTLOOK_4823 instantly because it’s looking for the literal string. Can’t connect "yearly" to "annual" at all.

You almost certainly want both. Run the two in parallel and combine the rank orders with RRF — Reciprocal Rank Fusion.

RRF solves a real problem: when two rankers return ranked lists with non-comparable scores (BM25 might return 0–30; dense returns 0–1), you can’t just add them. RRF throws away the actual scores and uses only the rank position of each doc in each list. For every doc, sum 1 / (k + rank) across all rankers, with k = 60 as the conventional constant. Docs that landed near the top of both lists win.

Worked example. Suppose for one query:

Dense ranks:  [docA, docB, docC]
BM25 ranks:   [docC, docD, docA]

RRF scores (with k=60):

docA: 1/(60+1)  + 1/(60+3)  = 0.0322
docB: 1/(60+2)  + 0         = 0.0161
docC: 1/(60+3)  + 1/(60+1)  = 0.0322
docD: 0         + 1/(60+2)  = 0.0161

Final order: A and C tied at top, B and D tied below. You didn’t need to know anything about either ranker’s score distribution.

flowchart LR
    Q[Query] --> D[Dense ANN<br/>top-100]
    Q --> B[BM25<br/>top-100]
    D --> RRF[RRF fuse<br/>k=60]
    B --> RRF
    RRF --> M[~150 unique<br/>candidates]
    M --> RR[Cross-encoder<br/>rerank top-10]

    style D fill:#fff,stroke:#1565c0
    style B fill:#fff,stroke:#2e7d32
    style RRF stroke:#c62828,stroke-width:2px

In code:

import asyncio
from collections import defaultdict

K_RRF = 60

async def hybrid_search(q, k=100):
    dense, lex = await asyncio.gather(
        dense_index.search(embed(q), k=k),
        bm25_index.search(q, k=k),
    )
    scores = defaultdict(float)
    for rank, doc_id in enumerate(dense):
        scores[doc_id] += 1.0 / (K_RRF + rank)
    for rank, doc_id in enumerate(lex):
        scores[doc_id] += 1.0 / (K_RRF + rank)
    return sorted(scores, key=scores.get, reverse=True)[:k]

On every real corpus I’ve measured, hybrid moves recall@10 by 5–15 points over dense-only at essentially zero added latency (the two retrievers run in parallel; BM25 via OpenSearch / Tantivy is faster than dense ANN at this scale). Skipping the lexical leg looks like simplicity and is actually a permanent quality ceiling.

Reranking: the slow model that gets the last word

The vector search you’ve built so far is fast but only roughly accurate. It returns the 100 vectors closest in space to the query, but “closest in vector space” is not the same thing as “actually answers this question.” Reranking is the step where a slower, more accurate model takes that shortlist of 100 and reorders it carefully — so the few documents you send to the LLM are the genuinely best ones.

Why is the vector search only roughly accurate? Because the vector DB is fast for a specific reason: every document’s embedding was computed once at ingest time. At query time, the model only runs on the query, never on the query and a candidate document together. The model never gets to look at the two side by side. It only measures whether they happen to land near each other in vector space.

That independence is what makes vector lookups microseconds. It also puts a ceiling on quality.

Two model architectures to keep in mind:

  • Bi-encoder — what your vector store uses. The model runs twice and separately: once on the query and once on each document, producing one vector for each. Then it compares those vectors by distance. Because document vectors are computed ahead of time at ingest, query-time work is just one model run on the query plus a fast vector lookup. Microseconds per lookup.
  • Cross-encoder — the model runs once with the query and a candidate document concatenated together as a single input. The output is a relevance score directly. Because the model gets to see both sides at the same time, the score is much more accurate. The price is speed: milliseconds per pair instead of microseconds per lookup, and you have to re-run for every candidate.

You can’t run a cross-encoder over 50M docs per query — the math doesn’t work. But you can run it over the 150-candidate shortlist from hybrid retrieval. This is a classic candidate-generation-then-ranking architecture, identical in shape to how every recommender or feed system works.

flowchart LR
    C[Corpus<br/>50M chunks] -. ANN .-> H[Hybrid top-150]
    H --> CE[Cross-encoder<br/>~568M params<br/>~30-80ms on L4 GPU]
    CE --> T[Top-10]
    T --> L[LLM]

    style C fill:#eee
    style H fill:#fff
    style T stroke:#c62828,stroke-width:2px

50M → 150 is cheap (precomputed-vector lookups). 150 → 10 is where the expensive joint scoring lives. Doing the joint scoring over the full corpus would be ~300,000x more compute per query.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")  # ~568M-parameter open model

def retrieve_and_rerank(q, top_k=10):
    candidates = hybrid_search(q, k=150)             # cheap (bi-encoder ANN)
    pairs = [(q, c.text) for c in candidates]
    scores = reranker.predict(pairs, batch_size=32)  # ~30-80ms on an L4 GPU
    top = sorted(zip(candidates, scores), key=lambda x: -x[1])[:top_k]
    return [c for c, _ in top]

The reranker is also where you enforce diversity. If the top-10 is ten near-identical paragraphs from ten different files, the user gets no new information. Two cheap fixes both work: a per-source cap (at most 2 chunks from any single document), or MMR (Maximal Marginal Relevance) — a re-ordering that penalizes a candidate for being too similar to ones already picked. Same diversity problem as any feed or search ranking system.

Two more options worth knowing about and usually not picking first:

  • ColBERT stores one embedding per token instead of per chunk. At query time, for each query token, find the most similar document token; sum across query tokens. Lands between bi-encoder and cross-encoder on quality and cost. Trade-off: ~10–20x storage for the per-token embeddings.
  • LLM-as-reranker — feed the candidates into a small LLM and ask it to rank them. Slow, expensive, non-deterministic, and usually worse than a dedicated cross-encoder at the same latency budget. The one place it wins: when you need actual reasoning over candidates (“which of these is most recent and most authoritative”), not just relevance.

Query understanding: the second-most-underbuilt piece

Users don’t write the queries documents are written for. They write fragments, follow-ups, and pronouns. Your retrieval pipeline has to bridge that gap before nearest-neighbor search has any chance of working.

The single highest-ROI addition to a multi-turn product is query rewrite — a small, fast LLM that turns a follow-up into a self-contained search query.

flowchart LR
    H["History:<br/>'what is the refund policy?'<br/>'how long do we have?'"] --> RW[Small LLM<br/>rewrite ~150ms]
    Q["'and for enterprise?'"] --> RW
    RW --> O["'How long do enterprise<br/>customers have to request<br/>a refund?'"]
    O --> S[Retrieve]

    style Q fill:#fff,stroke:#c62828
    style O fill:#fff,stroke:#1565c0
REWRITE_PROMPT = """Rewrite the user's latest message into a standalone search query.
Resolve pronouns and references using the conversation history.
Output only the rewritten query.

History:
{history}

Latest: {query}
Rewritten:"""

def rewrite(query, history):
    return small_llm(REWRITE_PROMPT.format(history=history, query=query))

A small model (GPT-4o-mini / Haiku / 8B-class) at ~150 ms is enough. Always log the rewrite alongside the original query. When a user complains about a bad result, the rewrite is usually where the wheels came off — and you cannot debug what you didn’t log.

A few other techniques, in rough order of how often they earn their cost:

  • Query decomposition — for compound questions (“compare SSO setup for Okta vs Entra ID”), split into independent sub-queries, retrieve for each, merge. Cuts down on the case where the top-k for the compound query only catches one side of the comparison.
  • HyDE (Hypothetical Document Embeddings) — ask the LLM to draft a plausible answer to the query, embed that, search with it. Useful when corpus vocabulary differs sharply from how users phrase questions (legal, medical). Adds an LLM call to the critical path.
  • Multi-query expansion — generate 3–5 paraphrases, retrieve each, RRF-fuse. Cheap, mildly helpful, easy to abuse past 5.

Multi-tenancy and ACLs: row-level security in your vector store

Production RAG never has a single user. Different users are allowed to see different documents, and the search has to respect that — exactly the same problem as row-level security in a SQL database (where every query is implicitly filtered to “rows this user is allowed to read”), just over vectors instead of rows.

The rules saying who can read what are called ACLs (Access Control Lists). You need to enforce them at retrieval time, not after, because if a forbidden document leaks into the prompt, the LLM will happily quote from it. Three architectures, in increasing strictness — and they map directly onto patterns you’ve seen in any multi-tenant SQL system:

# 1. POST-FILTER: fetch then drop. Simple, silently broken when the
#    accessible subset is small (over-fetch enormously or return nothing).
hits = index.search(v, k=200)
visible = [h for h in hits if user.can_read(h.doc_id)][:10]

# 2. PRE-FILTER: push ACL into the vector search (row-level security
#    inside the vector DB). Correct, but subject to the recall cliff
#    from earlier for selective filters.
hits = qdrant.search(
    query_vector=v,
    query_filter={"must": [{"key": "acl_principals",
                            "match_any": user.principal_set()}]},
    limit=10,
)

# 3. PER-TENANT PARTITION: physically separate index per tenant.
#    Same as sharding a SQL DB by tenant_id. Highest ops cost
#    (you now have N indexes to manage), but the only architecture
#    with bounded recall AND bounded latency AND clean blast-radius
#    isolation (one tenant's index corruption can't take down others).
tenant_index = index_for(user.tenant_id)
hits = tenant_index.search(v, k=10)
flowchart TB
    subgraph Post-filter
      direction LR
      Q1[Query] --> I1[(Single index)]
      I1 --> F1[Filter after]
      F1 --> R1[may over-fetch<br/>or return empty]
    end
    subgraph Pre-filter
      direction LR
      Q2[Query + ACL] --> I2[(Single index<br/>with filter)]
      I2 --> R2[recall cliff<br/>on selective filters]
    end
    subgraph Per-tenant partition
      direction LR
      Q3[Query] --> RT{Route by<br/>tenant}
      RT --> T1[(Tenant A)]
      RT --> T2[(Tenant B)]
      RT --> T3[(Tenant C)]
      T1 --> R3[bounded recall<br/>and latency]
    end

    style R1 stroke:#c62828
    style R2 stroke:#c62828
    style R3 stroke:#2e7d32,stroke-width:2px

Pre-filter is fine when at least ~10% of the corpus is accessible to any given user. Above a few thousand tenants, or under data-residency requirements (data physically must stay in a specific region), per-tenant partitioning is the only thing that holds.

ACLs change continuously — users join groups, docs get reshared, folders lock down. The right primitive is a worker that subscribes to permission-change events from your identity system (Active Directory, Okta) and incrementally updates each document’s allowed-users set in the index. Run a periodic full reconciler alongside it to catch missed events. Same pattern as any other CDC (Change Data Capture) pipeline — every change in an upstream source becomes an event, and a downstream worker keeps a derived store in sync. The corpus index is one of those derived stores: its contents are computed from upstream sources of truth and have to be kept up to date. Nothing about that is RAG-specific.

Freshness: the index is a derived table

A corpus indexed nightly is fine for static knowledge bases and unusable for anything operational. The product question — can a user ask about a document they edited five minutes ago? — drives the whole architecture.

Conceptually the index is a derived table: it does not own any data of its own, it is computed from somewhere else (the documents) and has to be kept in sync whenever that somewhere else changes. A SQL materialized view is the same idea. So the mechanics here are the same as any other materialized-view system.

Every edit in the source — a database write captured by a binlog, a SharePoint webhook firing on a doc edit, an S3 object-created event — produces a change event. An ingest worker pool consumes those events, re-chunks the affected document, and re-embeds only the chunks whose text actually changed. Most edits touch a handful of chunks out of hundreds, so the work per edit stays small.

Deletes go through as tombstones — small marker records that say “this doc id no longer exists” so the index can drop it on the next pass. A reconciler runs nightly to diff the corpus snapshot against the index and repair any gaps the event stream missed.

flowchart LR
    S[(Source<br/>SharePoint /<br/>DB / S3)] -- CDC events --> Q[(Change<br/>queue)]
    Q --> W[Ingest worker]
    W --> CH[Re-chunk]
    CH --> DIFF{Diff vs<br/>last snapshot}
    DIFF -- added/modified --> E[Embed only<br/>changed chunks]
    DIFF -- deleted --> TB[Tombstone]
    E --> VI[(Vector index)]
    E --> LI[(BM25 index)]
    TB --> VI
    TB --> LI

    R[Reconciler<br/>nightly] -.full diff.-> VI
    R -.full diff.-> LI

    style DIFF stroke:#c62828,stroke-width:2px
async def ingest_worker(change_stream):
    async for event in change_stream:   # CDC, SharePoint webhook, S3 events
        if event.type == "delete":
            await vector_index.delete(doc_id=event.doc_id)
            await bm25_index.delete(doc_id=event.doc_id)
            continue

        new_chunks = type_aware_chunk(event.doc)
        old_chunks = await chunk_store.get(event.doc_id)
        diff = chunk_diff(old_chunks, new_chunks)  # only re-embed changed

        if diff.added or diff.modified:
            vectors = await embedder.embed_batch([c.text for c in diff.upsert])
            await vector_index.upsert(diff.upsert, vectors)
            await bm25_index.upsert(diff.upsert)
        if diff.deleted:
            await vector_index.delete_ids(diff.deleted)
            await bm25_index.delete_ids(diff.deleted)
        await chunk_store.put(event.doc_id, new_chunks)

Doing this correctly under real churn — concurrent edits to the same document inside the embed-and-index lag, bulk uploads, upstream replays of a day’s worth of events — is the same family of distributed-systems problems I wrote about in replacing manual pipelines with workers and data infrastructure being the bottleneck for AI. Nothing new under the sun. The corpus index is a derived table; treat it as one.

Evaluation: per-stage metrics, or you’re flying blind

Most RAG systems in production are not evaluated. Teams ship, watch the support queue, react. This stops working past the demo.

Eval has to be per-stage, because the failure modes are per-stage. A bug where retrieval fails to find the right document and a bug where the LLM was handed the right document but hallucinated anyway both look identical from a user complaint — but they live in completely different parts of the code, and you cannot fix what you cannot localize.

flowchart LR
    Q[Query] --> RET[Retrieve]
    RET --> M1[/"recall@k<br/>MRR<br/>nDCG"/]
    RET --> RR[Rerank]
    RR --> M2[/"context<br/>precision"/]
    RR --> LLM[LLM]
    LLM --> M3[/"faithfulness<br/>answer relevance"/]
    LLM --> ANS[Answer]
    ANS --> U[User]
    U -. thumbs / clicks /<br/>did-rephrase .-> ON[(Online signal)]

    style M1 fill:#fff,stroke:#1565c0
    style M2 fill:#fff,stroke:#1565c0
    style M3 fill:#fff,stroke:#1565c0
    style ON fill:#fff,stroke:#c62828

Retrieval metrics first. They don’t need an LLM and run offline against a labeled set of (query, [correct_doc_ids]) pairs:

  • recall@k — fraction of queries where at least one correct doc made it into the top-k. The single most important number. Track k=5, k=10, k=50.
  • MRR (Mean Reciprocal Rank) — average of 1/rank of the first correct doc, over all queries. Captures whether correct docs are near the top, not just present. Correct doc at rank 1 contributes 1.0; at rank 5 contributes 0.2; not in top-k at all contributes 0.
  • nDCG@k (Normalized Discounted Cumulative Gain) — generalizes MRR to graded relevance (“doc A is very relevant, doc B is somewhat”) and multiple correct docs per query. Standard in IR; useful with richer labels than yes/no.
def recall_at_k(retrieved, relevant, k):
    return int(bool(set(retrieved[:k]) & set(relevant)))

def mrr(retrieved, relevant):
    for i, doc_id in enumerate(retrieved, start=1):
        if doc_id in relevant:
            return 1.0 / i
    return 0.0

def eval_retriever(retriever, labeled):
    rows = [{"r5":  recall_at_k(r := retriever(q), rel, 5),
             "r10": recall_at_k(r,           rel, 10),
             "mrr": mrr(r, rel)}
            for q, rel in labeled]
    return {k: sum(x[k] for x in rows) / len(rows) for k in rows[0]}

End-to-end metrics use an LLM-as-judge: a strong model that scores each (query, retrieved context, final answer) triple against a rubric. The three to implement:

  • Context precision — of the chunks you retrieved, what fraction were actually relevant? Catches “we retrieved 10 docs but 8 were noise.”
  • Faithfulness — of the claims in the answer, what fraction are supported by the retrieved context? Catches hallucination directly.
  • Answer relevance — does the answer address the question, regardless of correctness? Catches off-topic answers.

The ragas and trulens libraries implement all three with reasonable defaults. Generic LLM judges agree with humans ~70–80% of the time; domain-tuned ones hit 90%+. You’ll want to tune the rubric prompts.

The labeled set itself is the hard problem. Pragmatic path:

  1. Hand-write 50–100 queries with the answers you’d expect, with product, support, and domain experts.
  2. Mine logs. For every session with positive feedback (thumbs-up, user clicked a citation, user did not rephrase their question), capture the query and the chunks the user implicitly endorsed.
  3. Expand with an offline LLM judge to thousands of graded (query, doc) pairs.
  4. Audit a stratified sample by hand quarterly (some easy queries, some hard, some across each document type) and use the human judgments to calibrate how much you should trust the LLM judge.

Pair offline metrics with one or two online signals — thumbs, citation-click rate, did-the-user-rephrase rate — and alert if any of them drift week-over-week. Same SRE shape as any other production system.

The generation step, briefly

By the time the LLM gets the prompt, you’ve already decided how well it can possibly do.

PROMPT = """Answer the question using only the numbered sources below.
Cite sources inline as [1], [2], etc.
If the sources do not contain enough information to answer, say so and do not guess.

Sources:
{sources}

Question: {question}
Answer:"""

def build_prompt(question, chunks):
    sources = "\n\n".join(f"[{i+1}] ({c.source_id}) {c.text}"
                          for i, c in enumerate(chunks))
    return PROMPT.format(sources=sources, question=question)

Things worth getting right at this stage:

  • Cap context well below the model’s max. Frontier models claim 128k–200k context windows, but quality collapses long before that. The “lost in the middle” effect is real: when you stuff a long context into the prompt, the model reliably attends to information at the start and the end but loses information in the middle. 4–12k tokens of well-ranked context beats 50k tokens of marginal context.
  • Tell the model to refuse on insufficient context. That one sentence in the prompt is the difference between a system that occasionally hallucinates and one that often does.
  • Stream the answer; buffer the citations until generation is complete, so citation IDs don’t get rewritten as the model revises mid-stream.

Model choice within a generation of the frontier is mostly a cost-and-latency decision. The day you swap GPT-4o-mini for Haiku 4.5, the retrieval pipeline barely notices.

Putting it all together

flowchart LR
    subgraph Ingest
      S[(Sources<br/>SharePoint, Confluence,<br/>Git, S3, DBs)] --> CDC[Change feed<br/>workers]
      CDC --> EX[Extract &amp;<br/>type-aware chunk]
      EX --> EMB[Embed<br/>incrementally]
      EMB --> VI[(Vector index<br/>HNSW / per-tenant)]
      EX --> LI[(Lexical index<br/>BM25)]
      CDC --> ACL[ACL projection]
      ACL --> VI
      ACL --> LI
    end
    subgraph Query
      Q[User query +<br/>conversation] --> QR[Query rewrite /<br/>decompose]
      QR --> DR[Dense retrieve<br/>top-100]
      QR --> BR[BM25 retrieve<br/>top-100]
      DR --> F[RRF fuse]
      BR --> F
      F --> RR[Cross-encoder<br/>rerank top-10]
      RR --> G[LLM with<br/>cited context]
      G --> U[User]
      U -. feedback .-> EVAL[(Eval store)]
      EVAL -. weekly .-> QR
      EVAL -. weekly .-> EMB
    end

Notice how much of the architecture is retrieval and how little is the LLM. The LLM is one box on the right; everything else is the system deciding what gets sent to it.

Closing thought

The framing that RAG is an “AI” technique has done the field a disservice. RAG is an information-retrieval system with an LLM bolted onto the output, and the information-retrieval discipline is fifty years old — it has well-understood eval methodology and has solved most of the hard problems at least once. Teams that ship good RAG products recognize this and staff accordingly: search engineers, data engineers, infra engineers, with one or two people on prompting and the model.

Teams that struggle treat the whole thing as a prompting problem and keep swapping models hoping the next one fixes retrieval. It will not. The model cannot answer from documents it never received, cannot cite documents it never saw, cannot refuse based on context it was never given. Every one of those failures is upstream of the model, in the indexes, in the chunking, in the ACL projection, in the freshness pipeline, in the eval. That is where the work is, and that is where the durable advantage of an AI-backed product actually accumulates.

If your RAG system isn’t behaving the way you want, the prompt is almost never the answer. Look upstream.