Building a Production RAG System: Ingestion, Retrieval, Evaluation, and Observability

March 20, 2026 (1w ago)

Aloy Sathekge

AuditRAG

Most RAG tutorials stop at "embed some text, search a vector database, call an LLM." That's maybe 20% of what a production system needs. The other 80% is everything nobody talks about: how do you know your retrieval is actually working? What happens when you have thousands of chunks across dozens of documents? How do you track cost and latency per query?

I built AuditRAG to answer those questions — a fullstack RAG platform where you upload any PDF, chat with it, evaluate retrieval quality, and monitor everything. This post covers the architecture, the hard problems I ran into, and what I'd do differently.

The Architecture

The system breaks into five layers:

PDF Upload → Extract → Chunk → Embed → Qdrant (vector DB)
                                  ↓
                        LLM generates Q&A pairs → PostgreSQL

User Question → Hybrid Search (dense + sparse + RRF) → LLM → Cited Answer
                                                         ↓
                                                   Query telemetry → PostgreSQL

Backend: FastAPI serving the ingestion, retrieval, generation, evaluation, and observability APIs. Deployed on EC2 behind nginx.

Frontend: React + TypeScript + Tailwind. Three screens — chat, evaluation dashboard, and observability. Deployed on Vercel.

Storage: Qdrant for vector search, PostgreSQL for telemetry and auto-generated evaluation data.

Ingestion: More Than Just Embedding

The ingestion pipeline has four stages:

  1. Extract — pdfplumber pulls text page-by-page. Simple, but handles most PDFs well.
  2. Chunk — Sliding window over the full text. 512 words per chunk with 50-word overlap. Each chunk carries metadata: doc_name, page number, knowledge_base.
  3. Embed — Either local sentence-transformers (BAAI/bge-small-en-v1.5) or OpenAI's text-embedding-3-small. Local is free but slower on first load; OpenAI is fast and consistent.
  4. Upsert — Batched writes to Qdrant with deduplication.

The interesting part is what happens after embedding: the system auto-generates Q&A pairs. For every third chunk, the LLM creates 2-3 question-answer pairs grounded in that chunk's text. These get stored in PostgreSQL and become the evaluation dataset.

This means every document you ingest comes with its own test suite.

QA_SYSTEM_PROMPT = """You are a Q&A dataset generator. Given a text chunk
from a document, generate question-answer pairs that test comprehension
of the key facts in the text.
- Generate 2-3 question-answer pairs per chunk.
- Questions should be answerable ONLY from the given text.
- Answers should be concise and factual.
- Return valid JSON: a list of objects with "question" and "answer" keys."""

Retrieval: Why Hybrid Search Matters

I implemented three retrieval modes:

  • Dense — Semantic vector search in Qdrant. Good at understanding meaning, bad at exact keyword matches.
  • Sparse — BM25 keyword search. Good at exact terms, bad at paraphrasing.
  • Hybrid — Both, fused with Reciprocal Rank Fusion (RRF).

Hybrid is the default because neither dense nor sparse alone is reliable enough. A query like "What was the Q2 2023 revenue?" needs semantic understanding (what "revenue" means in context) AND keyword matching (the exact string "Q2 2023").

RRF fusion is elegant — it doesn't care about score scales:

def _rrf_score(ranks: list[int], k: int = 60) -> float:
    return sum(1.0 / (k + r) for r in ranks)

Each retrieval method ranks its results. RRF combines ranks, not scores, so you don't need to normalize anything.

The Qdrant Index Lesson

One bug cost me hours: Qdrant requires a payload index to filter on fields like knowledge_base or doc_name. Without it, filtered queries return a 400 error. Unfiltered queries work fine, which makes it confusing to debug.

The fix was creating keyword indexes at startup:

for field in ("knowledge_base", "doc_name"):
    client.create_payload_index(
        collection_name=QDRANT_COLLECTION,
        field_name=field,
        field_schema=PayloadSchemaType.KEYWORD,
    )

Small thing, but it's the kind of production detail that tutorials skip.

Generation: Two Modes

The LLM generates answers using only the retrieved context. The system prompt is strict:

You answer questions using only the provided context.
- Base your answer only on the context below.
- If the context does not contain enough information, say so clearly.
- When you use a number or fact, cite the source as [doc_name, page X].

I added a concise mode specifically for evaluation. The default mode produces verbose, cited answers — great for users, terrible for scoring. The concise prompt returns just the direct answer with no formatting:

Respond with ONLY the direct answer — no preamble, no citations,
no markdown formatting. Example: "Seth Weidman" not
"The author is **Seth Weidman** [source, page 1]."

The evaluation harness uses concise mode automatically. This one change dramatically improved eval scores without changing retrieval quality at all.

Evaluation: The Part Most People Skip

This is what separates a demo from a real system. When you upload a PDF, the auto-generated Q&A pairs become your test suite. Hit /evaluate and the harness:

  1. Loads Q&A pairs from PostgreSQL
  2. Runs each question through the full pipeline (retrieve + generate)
  3. Compares the RAG answer against the gold answer
  4. Computes metrics: exact match rate, token F1, average latency, average cost

Token F1 is more useful than exact match for RAG. The RAG might answer "Seth Weidman wrote the book" while the gold answer is just "Seth Weidman". Exact match says that's wrong. Token F1 measures token overlap and gives partial credit.

Even with concise mode and answer cleaning (stripping markdown, citations, trailing periods), exact match stays low. That's expected — and it's why token F1 exists.

The frontend shows per-question results so you can see exactly where retrieval fails:

| Question | RAG Answer | Expected | Match | |----------|-----------|----------|-------| | Who is the author? | Seth Weidman | Seth Weidman | Yes | | What is Ch. 3 about? | Neural network training | Training neural networks from scratch | No |

The second row is factually correct but doesn't match. Token F1 catches that.

Observability: Know What Your System Does

Every query logs to PostgreSQL: the question, retrieval latency, generation latency, token usage, cost, and which model was used. The /metrics endpoint aggregates:

  • P50 and P95 latency — Median vs. tail performance. If P95 is 5x your P50, you have a consistency problem.
  • Average cost per query — Tracks LLM spend. At scale this matters.
  • Query count — Volume over time.

The frontend renders this as a dashboard with health checks for each dependency (API, PostgreSQL, Qdrant).

Document Scoping

With multiple documents ingested, retrieval quality drops if you search everything. Asking "Who wrote this book?" across 8 documents and 988 chunks returns irrelevant legal filings instead of the book's preface.

The fix: per-document filtering. The query API accepts a doc_name parameter, and the frontend lets you select which document to chat with. The filter is applied at the Qdrant level for dense search and in Python for BM25 sparse search.

What I Learned

Retrieval quality is everything. Generation is only as good as the chunks you feed it. I spent more time debugging retrieval than any other part.

Evaluate from day one. Auto-generating Q&A pairs on ingestion was the best decision. Without evaluation data, you're guessing whether changes help or hurt.

Formatting kills eval scores. The same correct answer can score 0% or 100% depending on how it's formatted. Concise mode + answer cleaning solved this.

Payload indexes aren't optional. Qdrant's filtered search silently breaks without them. Always index fields you filter on.

Hybrid search is worth the complexity. Dense search alone misses keyword-dependent queries. BM25 alone misses semantic meaning. RRF fusion is cheap and effective.

Tech Stack

| Layer | Technology | |-------|-----------| | API | FastAPI + Uvicorn | | Vector DB | Qdrant | | Relational DB | PostgreSQL | | Embeddings | BAAI/bge-small-en-v1.5 (local) or OpenAI | | LLM | OpenAI gpt-4o-mini or Anthropic Claude | | Retrieval | Dense + BM25 Sparse + RRF Hybrid | | Frontend | React, TypeScript, Tailwind CSS | | Deployment | EC2 (backend), Vercel (frontend) |

Try It

The live demo is at auditrag.aloysathekge.com. Upload a PDF, ask questions, run the evaluation, check the metrics. The source is on GitHub.