Experiments and results
We evaluated agentic RAG on FramesQA, which is based on the FRAMES paper. An example multi-hop question is:
“Of the top two most watched television season finales (as of June 2024), which finale ran the longest in length and by how much?”
The RAG system needs to perform multiple steps to arrive at the correct answer. First, it has to identify that the two most watched finales are from the shows M*A*S*H and Cheers. Then, it has to find their running times, and calculate the length difference. In many RAG settings (Vanilla RAG or agentic RAG without sufficient context), we could end up in a situation where the model says something like:
“Despite multiple scans, I found no explicit runtimes for M*A*S*H or Cheers. The documents provide viewership data, but not the duration in minutes or hours.”
This does not answer the question.
Fortunately, our agentic RAG can solve this by first searching for the TV shows, then using the Query Rewriter and Sufficient Context Agent to have a targeted search for the run time of M*A*S*H or Cheers. Then, Gemini can easily determine which finale ran the longest in length and by how much:
“The M*A*S*H finale ran for 150 minutes, making it the longest of the top two. It was 52 minutes longer than the Cheers finale, which ran for approximately 98 minutes.”
We ran an experiment to test this ability at scale (FramesQA has 824 queries along with a corpus containing 2,676 PDF documents). In the “Vanilla” RAG setting, we use Google’s RAG Engine (which has an advanced retrieval engine, LLM parser, and re-ranker). We compared this with our agentic RAG in two settings. In the single-corpus setting, we retrieve from the FramesQA documents. In the cross-corpus setting, we also include three other distracting datasets, where the Planner Agent must determine where to retrieve from. This cross-corpus setting mimics use cases where companies have databases managed by separate teams. We compute accuracy by using an LLM-as-a-judge to compare the system responses to the ground truth answers in the dataset.
In the cross-corpus setting, our system nearly matches its single-corpus accuracy. Even when the Planner Agent must select the correct corpus out of 4 possibilities, we successfully route the search queries and answer 90.1% of questions correctly. Also, the latency of both single- and cross-corpus versions is about the same (within 3% on average). This demonstrates that our Agentic RAG system can reason over multiple, unrelated data sources, which opens up possibilities for more flexible retrieval scenarios.

