Working with Contexts – O’Reilly

Table of Contents

The following article comes from two blog posts by Drew Breunig: “How Long Contexts Fail” and “How to Fix Your Contexts.”

Managing Your Context is the Key to Successful Agents

As frontier model context windows continue to grow,¹ with many supporting up to 1 million tokens, I see many excited discussions about how long context windows will unlock the agents of our dreams. After all, with a large enough window, you can simply throw everything into a prompt you might need—tools, documents, instructions, and more—and let the model take care of the rest.

Long contexts kneecapped RAG enthusiasm (no need to find the best doc when you can fit it all in the prompt!), enabled MCP hype (connect to every tool and models can do any job!), and fueled enthusiasm for agents.²

But in reality, longer contexts do not generate better responses. Overloading your context can cause your agents and applications to fail in suprising ways. Contexts can become poisoned, distracting, confusing, or conflicting. This is especially problematic for agents, which rely on context to gather information, synthesize findings, and coordinate actions.

Let’s run through the ways contexts can get out of hand, then review methods to mitigate or entirely avoid context fails.

Context Poisoning

Context poisoning is when a hallucination or other error makes it into the context, where it is repeatedly referenced.

The Deep Mind team called out context poisoning in the Gemini 2.5 technical report, which we broke down previously. When playing Pokémon, the Gemini agent would occasionally hallucinate while playing, poisoning its context:

An especially egregious form of this issue can take place with “context poisoning”—where many parts of the context (goals, summary) are “poisoned” with misinformation about the game state, which can often take a very long time to undo. As a result, the model can become fixated on achieving impossible or irrelevant goals.

If the “goals” section of its context was poisoned, the agent would develop nonsensical strategies and repeat behaviors in pursuit of a goal that cannot be met.

Context Distraction

Context distraction is when a context grows so long that the model over-focuses on the context, neglecting what it learned during training.

As context grows during an agentic workflow—as the model gathers more information and builds up history—this accumulated context can become distracting rather than helpful. The Pokémon-playing Gemini agent demonstrated this problem clearly:

While Gemini 2.5 Pro supports 1M+ token context, making effective use of it for agents presents a new research frontier. In this agentic setup, it was observed that as the context grew significantly beyond 100k tokens, the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans. This phenomenon, albeit anecdotal, highlights an important distinction between long-context for retrieval and long-context for multistep, generative reasoning.

Instead of using its training to develop new strategies, the agent became fixated on repeating past actions from its extensive context history.

For smaller models, the distraction ceiling is much lower. A Databricks study found that model correctness began to fall around 32k for Llama 3.1-405b and earlier for smaller models.

If models start to misbehave long before their context windows are filled, what’s the point of super large context windows? In a nutshell: summarization³ and fact retrieval. If you’re not doing either of those, be wary of your chosen model’s distraction ceiling.

Context Confusion

Context confusion is when superfluous content in the context is used by the model to generate a low-quality response.

For a minute there, it really seemed like everyone was going to ship an MCP. The dream of a powerful model, connected to all your services and stuff, doing all your mundane tasks felt within reach. Just throw all the tool descriptions into the prompt and hit go. Claude’s system prompt showed us the way, as it’s mostly tool definitions or instructions for using tools.

But even if consolidation and competition don’t slow MCPs, context confusion will. It turns out there can be such a thing as too many tools.

The Berkeley Function-Calling Leaderboard is a tool-use benchmark that evaluates the ability of models to effectively use tools to respond to prompts. Now on its third version, the leaderboard shows that every model performs worse when provided with more than one tool.⁴ Further, the Berkeley team, “designed scenarios where none of the provided functions are relevant…we expect the model’s output to be no function call.” Yet, all models will occasionally call tools that aren’t relevant.

Browsing the function-calling leaderboard, you can see the problem get worse as the models get smaller:

Tool-calling irrelevance score for Gemma models (chart from dbreunig.com, source: Berkeley Function-Calling Leaderboard; created with Datawrapper)

A striking example of context confusion can be seen in a recent paper that evaluated small model performance on the GeoEngine benchmark, a trial that features 46 different tools. When the team gave a quantized (compressed) Llama 3.1 8b a query with all 46 tools, it failed, even though the context was well within the 16k context window. But when they only gave the model 19 tools, it succeeded.

The problem is, if you put something in the context, the model has to pay attention to it. It may be irrelevant information or needless tool definitions, but the model will take it into account. Large models, especially reasoning models, are getting better at ignoring or discarding superfluous context, but we continually see worthless information trip up agents. Longer contexts let us stuff in more info, but this ability comes with downsides.

Context Clash

Context clash is when you accrue new information and tools in your context that conflicts with other information in the context.

This is a more problematic version of context confusion. The bad context here isn’t irrelevant, it directly conflicts with other information in the prompt.

A Microsoft and Salesforce team documented this brilliantly in a recent paper. The team took prompts from multiple benchmarks and “sharded” their information across multiple prompts. Think of it this way: sometimes, you might sit down and type paragraphs into ChatGPT or Claude before you hit enter, considering every necessary detail. Other times, you might start with a simple prompt, then add further details when the chatbot’s answer isn’t satisfactory. The Microsoft/Salesforce team modified benchmark prompts to look like these multistep exchanges:

Microsoft/Salesforce team benchmark prompts

All the information from the prompt on the left side is contained within the several messages on the right side, which would be played out in multiple chat rounds.

The sharded prompts yielded dramatically worse results, with an average drop of 39%. And the team tested a range of models—OpenAI’s vaunted o3’s score dropped from 98.1 to 64.1.

What’s going on? Why are models performing worse if information is gathered in stages rather than all at once?

The answer is context confusion: The assembled context, containing the entirety of the chat exchange, contains early attempts by the model to answer the challenge before it has all the information. These incorrect answers remain present in the context and influence the model when it generates its final answer. The team writes:

We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

This does not bode well for agent builders. Agents assemble context from documents, tool calls, and from other models tasked with subproblems. All of this context, pulled from diverse sources, has the potential to disagree with itself. Further, when you connect to MCP tools you didn’t create there’s a greater chance their descriptions and instructions clash with the rest of your prompt.

Learnings

The arrival of million-token context windows felt transformative. The ability to throw everything an agent might need into the prompt inspired visions of superintelligent assistants that could access any document, connect to every tool, and maintain perfect memory.

But, as we’ve seen, bigger contexts create new failure modes. Context poisoning embeds errors that compound over time. Context distraction causes agents to lean heavily on their context and repeat past actions rather than push forward. Context confusion leads to irrelevant tool or document usage. Context clash creates internal contradictions that derail reasoning.

These failures hit agents hardest because agents operate in exactly the scenarios where contexts balloon: gathering information from multiple sources, making sequential tool calls, engaging in multi-turn reasoning, and accumulating extensive histories.

Fortunately, there are solutions!

Mitigating and Avoiding Context Failures

Let’s run through the ways we can mitigate or avoid context failures entirely.

Everything is about information management. Everything in the context influences the response. We’re back to the old programming adage of, “garbage in, garbage out.” Thankfully, there’s plenty of options for dealing with the issues above.

RAG

Retrieval-augmented generation (RAG) is the act of selectively adding relevant information to help the LLM generate a better response.

So much has been written about RAG that we’re not going to cover it here beyond saying: it’s very much alive.

Every time a model ups the context window ante, a new “RAG is dead” debate is born. The last significant event was when Llama 4 Scout landed with a 10 million token window. At that size, it’s really tempting to think, “Screw it, throw it all in,” and call it a day.

But, as we’ve already covered, if you treat your context like a junk drawer, the junk will influence your response. If you want to learn more, here’s a new course that looks great.

Tool Loadout

Tool loadout is the act of selecting only relevant tool definitions to add to your context.

The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round. Usually, your loadout is tailored to the context—the character, the level, the rest of your team’s makeup, and your own skillset. Here, we’re borrowing the term to describe selecting the most relevant tools for a given task.

Perhaps the simplest way to select tools is to apply RAG to your tool descriptions. This is exactly what Tiantian Gan and Qiyao Sun did, which they detail in their paper “RAG MCP.” By storing their tool descriptions in a vector database, they’re able to select the most relevant tools given an input prompt.

When prompting DeepSeek-v3, the team found that selecting the the right tools becomes critical when you have more than 30 tools. Above 30, the descriptions of the tools begin to overlap, creating confusion. Beyond 100 tools, the model was virtually guaranteed to fail their test. Using RAG techniques to select less than 30 tools yielded dramatically shorter prompts and resulted in as much as 3x better tool selection accuracy.

For smaller models, the problems begin long before we hit 30 tools. One paper we touched on previously, “Less is More,” demonstrated that Llama 3.1 8b fails a benchmark when given 46 tools, but succeeds when given only 19 tools. The issue is context confusion, not context window limitations.

To address this issue, the team behind “Less is More” developed a way to dynamically select tools using a LLM-powered tool recommender. The LLM was prompted to reason about, “number and type of tools it ‘believes’ it requires to answer the user’s query.” This output was then semantically searched (tool RAG, again) to determine the final loadout. They tested this method with the Berkeley Function-Calling Leaderboard, finding Llama 3.1 8b performance improved by 44%.

The “Less is More” paper notes two other benefits to smaller contexts: reduced power consumption and speed, crucial metrics when operating at the edge (meaning, running an LLM on your phone or PC, not on a specialized server). Even when their dynamic tool selection method failed to improve a model’s result, the power savings and speed gains were worth the effort, yielding savings of 18% and 77%, respectively.

Thankfully, most agents have smaller surface areas that only require a few hand-curated tools. But if the breadth of functions or the amount of integrations needs to expand, always consider your loadout.

Context Quarantine

Context quarantine is the act of isolating contexts in their own dedicated threads, each used separately by one or more LLMs.

We see better results when our contexts aren’t too long and don’t sport irrelevant content. One way to achieve this is to break our tasks up into smaller, isolated jobs—each with their own context.

There are many examples of this tactic, but an accessible write up of this strategy is Anthropic’s blog post detailing their multi-agent research system. They write:

The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.

Research lends itself to this design pattern. When given a question, several subquestions or areas of exploration can be identified and separately prompted using multiple agents. This not only speeds up the information gathering and distillation (if there’s compute available), but it keeps each context from accruing too much information or information not relevant to a given prompt, delivering higher quality results:

Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single-agent system failed to find the answer with slow, sequential searches.

This approach also helps with tool loadouts, as the agent designer can create several agent archetypes with their own dedicated loadout and instructions for how to utilize each tool.

The challenge for agent builders, then, is to find opportunities for isolated tasks to spin out onto separate threads. Problems that require context-sharing among multiple agents aren’t particularly suited to this tactic.

If your agent’s domain is at all suited to parallelization, be sure to read the whole Anthropic write up. It’s excellent.

Context Pruning

Context pruning is the act of removing irrelevant or otherwise unneeded information from the context.

Agents accrue context as they fire off tools and assemble documents. At times, it’s worth pausing to assess what’s been assembled and remove the cruft. This could be something you task your main LLM with or you could design a separate LLM-powered tool to review and edit the context. Or you could choose something more tailored to the pruning task.

Context pruning has a (relatively) long history, as context lengths were a more problematic bottleneck in the natural language processing (NLP) field prior to ChatGPT. Building on this history, a current pruning method is Provence, “an efficient and robust context pruner for question answering.”

Provence is fast, accurate, simple to use, and relatively small—only 1.75 GB. You can call it in a few lines, like so:

from transformers import AutoModel

provence = AutoModel.from_pretrained("naver/provence-reranker-debertav3-v1", trust_remote_code=True)

# Read in a markdown version of the Wikipedia entry for Alameda, CA
with open('alameda_wiki.md', 'r', encoding='utf-8') as f:
    alameda_wiki = f.read()

# Prune the article, given a question
question = 'What are my options for leaving Alameda?'
provence_output = provence.process(question, alameda_wiki)

Provence edited down the article, cutting 95% of the content, leaving me with only this relevant subset. It nailed it.

One could employ Provence or a similar function to cull down documents or the entire context. Further, this pattern is a strong argument for maintaining a structured⁵ version of your context in a dictionary or other form, from which you assemble a compiled string prior to every LLM call. This structure would come in handy when pruning, allowing you to ensure the main instructions and goals are preserved while the document or history sections can be pruned or summarized.

Context Summarization

Context summarization is the act of boiling down an accrued context into a condensed summary.

Context summarization first appeared as a tool for dealing with smaller context windows. As your chat session came close to exceeding the maximum context length, a summary would be generated and a new thread would begin. Chatbot users did this manually in ChatGPT or Claude, asking the bot to generate a short recap that would then be pasted into a new session.

However, as context windows increased, agent builders discovered there’s benefits to summarization beyond staying within the total context limit. As the context grows, it becomes distracting and causes the model to rely less on what it learned during training. We called this context distraction. The team behind the Pokémon-playing Gemini agent discovered anything beyond 100,000 tokens triggered this behavior:

While Gemini 2.5 Pro supports 1M+ token context, making effective use of it for agents presents a new research frontier. In this agentic setup, it was observed that as the context grew significantly beyond 100k tokens, the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans. This phenomenon, albeit anecdotal, highlights an important distinction between long-context for retrieval and long-context for multi-step, generative reasoning.

Summarizing your context is easy to do, but hard to perfect for any given agent. Knowing what information should be preserved and detailing that to an LLM-powered compression step is critical for agent builders. It’s worth breaking out this function as it’s own LLM-powered stage or app, which allows you to collect evaluation data that can inform and optimize this task directly.

Context Offloading

Context offloading is the act of storing information outside the LLM’s context, usually via a tool that stores and manages the data.

This might be my favorite tactic, if only because it’s so simple you don’t believe it will work.

Again, Anthropic has a good write up of the technique, which details their “think” tool, which is basically a scratchpad:

With the “think” tool, we’re giving Claude the ability to include an additional thinking step—complete with its own designated space—as part of getting to its final answer… This is particularly helpful when performing long chains of tool calls or in long multi-step conversations with the user.

I really appreciate the research and other writing Anthropic publishes, but I’m not a fan of this tool’s name. If this tool were called scratchpad, you’d know its function immediately. It’s a place for the model to write down notes that don’t cloud its context and are available for later reference. The name “think” clashes with “extended thinking” and needlessly anthropomorphizes the model… but I digress.

Having a space to log notes and progress works. Anthropic shows pairing the “think” tool with a domain-specific prompt (which you’d do anyway in an agent) yields significant gains: up to a 54% improvement against a benchmark for specialized agents.

Anthropic identified three scenarios where the context offloading pattern is useful:

Tool output analysis. When Claude needs to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;

Policy-heavy environments. When Claude needs to follow detailed guidelines and verify compliance; and

Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).

Takeaways

Context management is usually the hardest part of building an agent. Programming the LLM to, as Karpathy says, “pack the context windows just right,” smartly deploying tools, information, and regular context maintenance is the job of the agent designer.

The key insight across all the above tactics is that context is not free. Every token in the context influences the model’s behavior, for better or worse. The massive context windows of modern LLMs are a powerful capability, but they’re not an excuse to be sloppy with information management.

As you build your next agent or optimize an existing one, ask yourself: Is everything in this context earning its keep? If not, you now have six ways to fix it.

Footnotes

Gemini 2.5 and GPT-4.1 have 1 million token context windows, large enough to throw Infinite Jest in there with plenty of room to spare.
The “Long form text” section in the Gemini docs sum up this optmism nicely.
In fact, in the Databricks study cited above, a frequent way models would fail when given long contexts is they’d return summarizations of the provided context while ignoring any instructions contained within the prompt.
If you’re on the leaderboard, pay attention to the “Live (AST)” columns. These metrics use real-world tool definitions contributed to the product by enterprise, “avoiding the drawbacks of dataset contamination and biased benchmarks.”
Hell, this entire list of tactics is a strong argument for why you should program your contexts.

Source link