March 2025 · ~9 min read

RAG without the theater

A shiny “chat with your PDF” demo and something people trust on a Tuesday are different animals. The gap usually is not “we picked the wrong model brand.”

Laptop showing charts and analytics on the screen — Photo by Luke Chesser on Unsplash

TL;DR

Most of the time, bad answers are bad retrieval or bad chunks, not “the LLM is dumb.”
People will live with a rough answer if they can see where it came from. They will not live with made up confidence.
If you cannot show that a change helped or hurt something measurable, you are guessing.

The demo is lying to you, nicely

Demos love the first question. Short doc, safe question, answer that quotes a paragraph the room already read. Fine as a proof of concept. The real job is ugly PDFs, tables that split weird, the same acronym meaning two things in two chapters, and questions people ask like they talk, not like they write release notes.

When things go wrong there, I rarely think “we need a smarter model.” I think the evidence never got in the prompt, or it got cut in a dumb place. So I spend more energy on whether the pipeline is something I can inspect, swap out, and test, same as I would for anything else in the system.

Chunking is where you design the interface

People file chunking under “data prep.” To me it is the moment you decide what counts as one idea for your user. Fixed size windows are easy to code and awful in the wild: you cut tables in half, you separate bullets from the heading that names them. Smarter splitting costs time up front, but you get answers that do not fall apart when the same point shows up in different words later in the pile.

I ask: what is the smallest blob that still makes sense if it shows up alone in the prompt? If it needs context from its neighbors to mean anything, you are asking the model to fill in blanks. That is where the confident wrong stuff starts.

Search is the product

Hybrid retrieval is not a flex. It is admitting embeddings and keywords miss different things. Dense search helps when people paraphrase. Lexical still wins when someone pastes a policy number, a file name, or an error string. I want to tune and measure those paths on their own, then merge scores without pretending either one is magic.

Always pulling the same number of chunks no matter the question is not a strategy. It is a default. Under load you need timeouts, fallbacks, and empty states someone actually designed, not something you discover the morning after launch.

Citations are trust

If your answer might get forwarded to legal or finance, “trust me” is not enough. If the user cannot see why the model said what it said, you did not build knowledge access. You built a slot machine with footnotes. Clickable sources, page hints, and a clear line between “this came from the file” and “the model inferred this” change whether anyone adopts the thing.

That is also product taste. A consumer chatbot can waffle. Something sitting on top of contracts or runbooks cannot. The UI and the guardrails should match how high the stakes are.

Measurement beats vibes

Swapping models without a small set of real questions you care about is a coin flip with a press release. I care less about leaderboard scores than about a boring list of scenarios, maybe with gold answers or a simple rubric, so you notice when something regresses. Spreadsheet and human spot checks count. So do cheap automated checks when the answer is unambiguous.

I am not asking for academic purity. I want someone to be able to say “we changed X” and point at retrieval hits, citation quality, latency, something concrete, not “it feels smarter.”

Swappable pieces, clear edges

I want to change embeddings, models, or the vector store without rewriting the whole app. That means boring contracts: ingestion hands off chunks with stable ids, retrieval returns ranked evidence with metadata, generation consumes that under rules you wrote down. When those edges exist, you can upgrade one layer without the orchestration folder turning into a nest of special case hacks.

For me “without the theater” means the story you tell in the demo is the same story you can maintain on a random Monday, messy files, question nobody rehearsed, no sparkle emoji saving you.