RAG: the illusion of AI memory — Gerald Aichholzer

Symbolic image on retrieval-augmented generation

Language models like ChatGPT or Claude have a property that's easy to overlook: they work with a limited context window. Everything discussed in a chat, every document, every instruction, has to fit inside that window. For most models that's a few hundred thousand tokens today, and several million for the largest ones. Sounds like a lot, but it isn't, once you think about the accumulated knowledge of a company: contracts, minutes, manuals, org charts, emails. That quickly adds up to millions of pages. Simply dumping all of it into the chat doesn't work, and even if it were technically possible, it would be unaffordable, because every token costs compute time.

Retrieval-augmented generation, RAG for short, was seen as the elegant solution to exactly this problem. Instead of handing the model everything at once, an upstream system pulls out the supposedly relevant documents for each question and feeds only those into the context. Plenty of companies celebrated it.

I wanted to know for sure, so I set up a concrete test with my team: we built a RAG prototype with LangChain and turned it loose on our internal company documents. The test case was deliberately demanding: we asked questions about changing organizational structures, the kind of questions where the system has to keep old and new responsibilities apart. The result was sobering. The system confused departments, mixed outdated structures with current ones, and invented responsibilities that never existed. Exactly the kind of mistake that can't happen in a company.

Partial retrieval instead of understanding

What had happened? RAG works like this: it breaks documents into fragments, so-called chunks, and for each question it pulls out the pieces that fit best semantically. This handful of fragments is passed to the language model as context. Anything that doesn't land in that excerpt doesn't exist for the model. No second look, no “wait, wasn't there something else?”

You can picture it like a Google search: you get the five most relevant hits, but only five. Not the whole of the knowledge, not the connections between the documents, just isolated pieces. That's why I prefer to call it by the more honest name: partial retrieval. That's exactly what it is: an excerpt, not an understanding.

Infographic: how RAG works — How RAG works: out of all the company documents, only a few fragments make it into the AI model, the rest is ignored.

In our test with the organizational structures, that was precisely the problem: the system jumbled together chunks from documents of different years without recognizing that an org chart from 2022 can't be combined with one from 2024. The overall context was missing, and without it every answer becomes guesswork.

Google AI Overviews: partial retrieval in action

Anyone who thinks this is some obscure technical issue just needs to pay attention to Google's AI Overviews now and then. These AI summaries above the search results work on a similar principle. And every so often, exactly this happens: statements from three different websites get blended into one answer that sounds coherent but isn't factually correct. Each individual source was correct on its own. Mixed together, it turns into something false. One documented example: for the made-up idiom “you can't lick a badger twice,” Google's AI happily supplied a meaning, assembled from plausible-sounding fragments.

With Google that's annoying. In a company, with compliance questions or strategic decisions, it can get genuinely expensive.

Why real understanding needs the full context

When I get to grips with a topic, I don't read five random paragraphs. I read the whole document. I recognize connections, I weigh contradictions. RAG does the opposite: it grabs fragments by semantic proximity and has no idea what sits to the left and right of them. That so few people raise this point bothers me.

Conclusion

RAG isn't a bad tool, as long as you understand what it is: a search engine with an AI surface, not an LLM holding the entire company's knowledge. Use RAG for simple lookup questions, for clearly defined sets of data, and you'll get usable results. But anyone who believes they have an AI memory that understands every connection inside the company will sooner or later run into exactly the problems we saw in our test. Whether larger context windows or new knowledge architectures change that remains to be seen. Until then, we should call partial retrieval by its name and stop mistaking it for more.