Are We Overcomplicating RAG Architectures?
Vector databases have become the default for RAG — but lately I've been finding that a simpler, vectorless PageIndex often performs just as well, at a fraction of the complexity.
I spent about three weeks last quarter building what should have been a straightforward RAG pipeline. Embeddings, a vector DB, a reranker, a little evaluation harness — the whole cookbook. It shipped. Then someone on the support team asked why searching for the error code AUTH_EXPIRED_TOKEN was returning every auth article except the one that literally contained the string three times. That was the week I started taking vectorless RAG seriously.
What I mean by vectorless
Not "never use vectors" — more "vectors are one tool, not the whole toolbox." A vectorless PageIndex treats documents as discrete nodes in a standard data store: a relational DB, a document DB, or an in-memory index if the corpus fits. The retrieval layer becomes ordinary code hitting ordinary queries, and the LLM steps in only where it's actually helping.
Here are three patterns I've been mixing in lately, none of which require a single embedding.
1. Lexical search, the one we all forgot about
This is the one that sold me. Cosine similarity is wonderful for conversational queries and absolutely lost when a user types a product SKU, an acronym, or a user ID. In the AUTH_EXPIRED_TOKEN case, the embedding model had decided the page about general auth flows was "more similar" to the query than the page literally titled with the error. BM25 doesn't have opinions. It finds the string.
We added Postgres full-text search in about a day — no new infrastructure, no new SDK — and that entire class of bug just disappeared. If your corpus has anything that looks like an identifier, you want this in the loop.
2. Let the LLM write a WHERE clause
Most real document collections come with metadata you're not fully using: product line, version, author, region, date. On a recent project we tagged every page with a handful of those and gave the LLM a tiny function that let it filter before searching. Its first job wasn't to answer the question; it was to narrow the haystack. Most queries came down to four or five candidate pages before any retrieval happened at all.
Side benefit: the retrieval is now easy to explain. When someone asks "why did the bot surface that page," you can literally show them the filter.
3. Summaries as a routing layer
For a mid-sized corpus — say, a few hundred documents — keep a ~200-token summary per document, version-controlled alongside the content. At query time, the LLM reads all of the summaries in one call, picks the one most likely to contain the answer, and only then do you fetch the full page.
It sounds expensive. It isn't, really. One larger call with summaries is cheaper — and much more accurate in practice — than a dozen small calls against vector matches that turned out to be adjacent but wrong.
Why I keep reaching for this first now
- No embedding calls on the hot path. Latency and cost both drop noticeably, especially at scale.
- Nothing new to operate. No vector store to provision, back up, or migrate when the embedding model changes.
- Updates are trivial. Content change? It's a row update. No re-indexing, no "which version of text-embedding-X produced this vector."
Where I still reach for vectors
I don't want to oversell this. Our consumer-facing search still uses embeddings, because users ask things like "how do I stop the thing from doing the thing" and no amount of BM25 is going to help with that. Vectors are the right tool for synonym-heavy, conversational, fuzzy-intent queries. They're just not the right tool for every query.
Reach for the simplest retrieval that answers the question. Upgrade to vectors when the question actually needs them.
The quiet lesson for me has been that RAG is a retrieval problem first and an AI problem second. Most of the wins I've seen lately came from treating it that way — classic search, honest metadata, a touch of LLM routing — and only pulling out the vector DB when the query truly calls for it.
Curious what other teams are landing on. Has anyone gone fully vectorless in production, or settled on a hybrid that routes between BM25 and embeddings by query type? Would love to hear what's actually working.