Artificial Intelligence

What is Retrieval Augmented Generation (RAG) and why does it matter?

Part 1: How does a Large Language Model work?

You may have seen a lot of online posts about RAGs in the past year or two. This is not a coincidence. RAGs are one of the key conceptual advancements in AI that make large language models useful in practice.
But what a RAG is and how it works remains poorly understood by many. In this three-part series article, I will answer these questions in a way that provides both depth and is intuitive to understand.

But before we get to RAGs, let’s take a step back and see what makes RAGs possible to begin with: the large language models (LLMs).

Large language models (LLMs) are helpful because they model language structure well and understand context.

LLMs such as GPT-4 (and as of yesterday GPT-5), Gemini, Claude etc., the so called “foundational models”, have been trained on almost the entire internet worth of textual and image content (you read that right, we have pretty much exhausted the internet as the source of training data for LLMs), consuming gigawatt hours of power and billions of euros in development costs in the process.

The core idea behind large LLMs relies on two concepts:

First, LLMs are (extremely efficient) compression algorithms. The idea here is that if we wanted to compress an enormous amount of text, we would likely discover that a very efficient way to compress information is to learn the underlying rules that define the text's structure, in addition to a set of core building blocks.

In other words, LLMs are highly efficient at compressing information because they “learned to understand” grammar, vocabulary and other structural features of languages.

Second, a significant challenge with modelling language is that language is contextual, and context can be spread around the text. Understanding the context of the text through AI is something we struggled with until the concept of the Attention Mechanism came into play.

Let’s say that I give you two sentences:

I am going to the bank to pick up some cash,
I am sitting on the riverbank and watching the sunset.

Both sentences use the word “bank,” but the meaning of the word differs, depending on the sentence's context. In the first sentence, the meaning of the word “bank” is determined by the context “cash” which appears at the end of the sentence.
In the second, it is determined by the context “river”, which is the word before “bank”.

Before the attention mechanism, AI language models could efficiently capture only the context that was in the “neighbourhood” of the word itself, which was hardly sufficient to model the full complexity of how we construct meaning from language and text.

The Attention Mechanism was the technological breakthrough that allowed AI models to capture long-distance correlations between words and hence efficiently “understand” context.

As it turns out, having the ability to express oneself (i.e., language structure) and to understand the context of the text is sufficient to accomplish many previously elusive intellectual tasks. This is precisely what LLMs do.

Large language models are prohibitively expensive to modify and static in time.

LLMs are colossal achievements of human kind, but their colossal scale is also their main limitation:

Only a handful of entities in this world can afford to train such models. The cost and other resource requirements are just prohibitive. For example, the development costs of state-of-the-art commercial large language models have already reached billions of euros per model.
Large language models can make things up. We call this “hallucinations”. It’s a structural feature of probabilistic models and the way we train them. If an LLM doesn’t know something, it will likely generate an answer based on what it does know. While the rate of hallucinations in modern LLMs is relatively low (about 1-2%), it was historically one of the prominent motivations for introducing RAG systems.
Large language models represent a snapshot in time of their training data. They compress information and knowledge that was publicly available at the time, but updating them to take into account news, updates to stock indexes, or any other rapidly evolving data would be prohibitively expensive for more than 99% of us.
When you use ChatGPT, you can use it as you would a search engine. Admit it, you have probably typed: “What do you know about me?” into Chat-GPT before. You also may have noticed disclaimers that come with LLMs that say things like “2022 knowledge cutoff” or something similar. This happens because the training data of LLMs represents a snapshot in time when companies such as Anthropic, OpenAI, and Google, legally or illegally, scraped the internet.

However, most business applications of LLMs don’t care about using AI for searching a stale version of the web. They care about what AI can do with their own, fast-changing, and up-to-date data.

This is where RAGs come in. But before we dig into how RAGs operate, it’s essential to understand a few key points about why LLMs make RAGs possible.

LLMs can derive new functionalities not by modifying the AI model, but by changing the inputs.

You can ask any LLM: “Write me an offer for building a 2-story house,” and you’ll likely be underwhelmed with the result. The reason is that the LLM doesn’t have all the relevant context about your client, your business, the past conversations you’ve had about the project, etc.

In other words, when you say “Write me an offer for building a 2-story house,” what you really mean to say is “Write me an offer for building a 2-story house, given the following relevant information about the client, the project, etc.”

The question then is, how would you provide the AI model with the relevant information and other context?
In the past, the only way to “inject” new information into an AI model was to fine-tune it, likely using any one of the existing transfer learning approaches. Transfer learning is a process by which you take an AI model that has been trained for one purpose and fine-tune it for another purpose in a computationally efficient manner.

For example, let’s say we trained a model to recognize different types of objects, such as trees, cats, buildings, and cars. I could take this model and fine-tune it to identify various kinds of furniture by modifying only a part of the original model, rather than training it from scratch.
This is a very computationally efficient way to derive new functionalities from existing models and has been the cornerstone of AI developments for the past decade.

This, of course, may not work well for LLMs, as even fine-tuning a part of an LLM can be either impossible due to licensing issues or very resource-intensive. However, what is fundamentally different about LLMs is that to derive a new function of an LLM, we don’t necessarily need to modify the model; we only need to change the inputs.

Let’s go back to our example about writing an offer for building a house. To enable ChatGPT or Gemini to understand your context and purpose, consider attaching relevant documents and providing particular instructions in the LLM prompt.
You may even want to give a few examples of input data and desired outputs (this is called few-shot learning). Odds are that if you do that, the quality of the answer will be much, much better.

What you have done, in effect, is repurpose an LLM, which was trained to understand the general structure of the language and how to comprehend the context of the query, to do something particular: write an offer, based on your own unique data and information.

Enjoyed this insight?
Share it to your network.

Mihailo Backovic
Ready to start shaping the business of tomorrow?
[email]