Vincent Min, B12 Consulting
Overcoming failure modes of RAG with agents
Standard RAG implementations are linear in design and hence suffer from various failure modes. Agentic RAG systems have fewer failure modes and are powerful,
flexible systems that can tackle a wide set of uses cases, including RAG.

Large Language Models (LLMs) have taken the world by storm. Many clients come to B12 Consulting to see how their company can benefit from LLMs. A typical use case that we arrive at is to create an LLM based system to efficiently and effectively navigate and explore their knowledge base.
The standard implementation of such a system goes by the name of Retrieval Augmented Generation (RAG). A RAG system has a chatbot interface that allows users to ask questions and the system uses the knowledge base to craft the answer and responds to the user with natural language.
Providing the user with a chatbot on the client knowledge base significantly reduces the need for humans to take on the laborious and monotonous task of sifting through different data sources. However, basic implementations of such a RAG system have certain failure modes that are not immediately visible to the end user. This leads to possible mistakes by, and reduced adoption of, the chatbot.
To understand these failure modes, we need to take a step back and inspect how a basic RAG system is built. You'll read about the necessary ingredients in this insight.
1. Knowledge Base for Efficient Information Retrieval
A database that represents the knowledge base and can be used to efficiently retrieve relevant information based on the user question. Typically, one uses a vector database that enables “semantic search”. This is a powerful approach that goes beyond simple keyword based search and allows to retrieve information that is semantically similar to the user question. It is important to note that the database retrieves a finite set of relevant pieces of the raw documents in the database.
2. LLM for Data-Driven Answers
An LLM to answer the question based on retrieved information. We can represent the system as a graph as follows:

From the graph we can immediately see that the database plays a crucial role in the RAG system. It is responsible for gathering the relevant pieces of information and presenting it to the LLM so it can effectively answer the user's question. If the database fails to retrieve a relevant piece of information, then the LLM will be unlikely to answer the question well. Significant effort must thus be put in constructing the database well.
What is perhaps less obvious is that even with a perfect database, there are limitations to this system. Imagine for example that the database contains a Harry Potter book. The RAG system will be able to answer questions such as “To which Hogwart house does Harry Potter belong?”, or “What is the name of the spell to disarm another wizard?”. This is because there are likely single paragraphs in the book that contain the answer to these questions. If the database is constructed well, it will retrieve these pieces and the LLM will have all the context needed to answer the question.
However, what would happen if we ask “What animal does the godfather of Harry turn into?”? Let’s follow the flow of the graph:
- The user question comes in.
- We retrieve relevant pieces of the Harry Potter book, based on the user question. For example, as is standard, we retrieve the top 5 most similar paragraphs using a vector database. Note that these are raw paragraphs extracted directly from the book.
- The LLM is passed the user question together with the 5 most similar paragraphs and tries to answer the question.
What would the retrieved paragraphs look like? Most likely there will be a paragraph explaining who the godfather of Harry is. And maybe some paragraphs mentioning animals. But it is unlikely that we find all the information needed in these 5 paragraphs.
How could we do better at handling this question? First, it’s important to recognise that this question is a “multi-hop” question, meaning that two or more steps need to be taken to answer it. This particular question can be decomposed into two steps:
1. Who is the godfather of Harry Potter?
2. What animal could Sirius Black turn into?
Note that formulating the second question requires us to first know the answer to the first question, i.e. “Sirius Black”.
We thus need a system that can handle “multi-hop” questions execution flows. There is a natural architecture for such systems and it is an “agentic” one. In its simplest form, an Agentic RAG (ARAG) can be represented by the following graph:

At the center of this graph we see an Agent node. This agent has access to a set of tools which it can use to respond to the user's question. In this graph the agent only has access to one tool: “Search”, but it is easily extendable to a multi-tool setup if the use case requires it. The “Search” tool is powered by the database and as before it takes a user query and returns a finite set of relevant paragraphs. The agent can repeatedly invoke the “Search” tool until it decides it has enough information, or that it is unable to answer the question with the current set of tools and user provided information.
This agentic RAG setup is powerful enough to handle “multi-hop” questions. In the example above, the model will understand that it first needs to find out who is the godfather of Harry Potter before it can search for the type of animal this person can transform into.
There are several other failure cases of the original linear RAG system that are resolved with this agentic version. For example:
-
In a chatbot it is common for the user to refer to previous questions that it asked. In the linear RAG setup, the retrieval is based only on the last user question, which may be insufficient. For example consider the following set of user messages:
1. What is the name of the father of Harry Potter?
2. And the mother?The linear RAG system will retrieve relevant paragraphs on “And the mother?”, which is missing “name” and “Harry Potter”, thus resulting in reduced retrieval accuracy. In contrast, the agentic RAG system will be able to look back into the history of messages and make appropriate calls to the “Search” tool. Thus agentic RAG systems handle reference to past chat messages with better performance.
-
Users may ask multiple questions in a single user message. For example “What is the name of Harry’s broom and who gave it to him?” This can be considered a kind of multi-hop question and is thus handled well by agentic RAG systems, but not by linear RAG systems. This could in principle be avoided by carefully instructing the user on how to ask questions to the chatbot. However, this is undesirable from a change management and user experience perspective. Users of the system should not have to have an understanding of the RAG internals and their failure modes in order to use the system. Agentic systems allow us to handle more general user inputs to the chatbot in a way that “just works”.
Another benefit of an agentic RAG system is the flexibility to extend to new use cases by adding new tools. For example, the client may want to connect a new source of documents to the chatbot, but only under certain conditions should the chatbot be able to use this additional context. For example, we can add the screenplays of the Fantastic Beasts movie to our Harry Potter chatbot, and the chatbot should only use these screenplays if the question mentions something about Fantastic Beasts. With the agentic system, this is as simple as adding a “Search_Fantastic_Beasts” tool together with instructions on when and how to use it. This would also be possible with a linear RAG system, but as the complexity of the system grows, the agentic RAG will scale better.
At B12 Consulting, we have in-depth workshops with the end users of our systems to determine their current workflow and how to assist them with AI tools. This can lead to very powerful agentic architectures with complex flows. Consider for example the following extension of the Harry Potter chatbot:

This version has checks and fallbacks in place to prevent abuse of the system or undesired output from the system, as well as the ability to cite the sources that it used to answer the question. Designing such use case specific agentic flows is the natural evolution of the field of building LLM systems. We leave a more in depth analysis of this flow engineering for a future blog post.

Vincent Min
Ready to start shaping the business of tomorrow?
[email]