Rethinking the Agent Harness – O’Reilly

Table of Contents

[ad_1]

We kicked off our new weekly series This Week in AI on Monday, and we covered a lot of ground in 30 minutes, including an AI model that found security holes faster than decades of human auditing, a data center in Utah the size of two Manhattans, and a practical argument for why the harness you build around a model now matters more than which model you pick.

Here are a few takeaways from the conversation between host Eric Freeman, faculty member at UT Austin and a longtime friend of O’Reilly, and guest John Berryman, founder of Arcturus Labs, an early production engineer on GitHub Copilot, and coauthor of O’Reilly’s Prompt Engineering for LLMs. Watch the entire episode to find out why you should be building your own agent and why John believes eventually there will be no internet for humans.

AI’s security problem is now a policy problem

You’ve probably already heard about Mythos. Anthropic’s internal testing of the frontier model surfaced thousands of previously unknown security vulnerabilities across major operating systems, browsers, and financial infrastructure, including a 27-year-old bug in OpenBSD. Anthropic chose not to release the model publicly and instead launched Project Glasswing, a restricted program giving monitored access to a small group of trusted partners for defensive patching.

That decision moved fast in Washington. In roughly six weeks, the conversation shifted from the light-touch national AI policy released in March to reported White House discussions of an executive order review process modeled on how the FDA handles drugs. Security researcher Bruce Schneier has questioned whether Mythos is uniquely capable here or whether similar results are achievable with cheaper public models, but as Freeman noted (paraphrasing Schneier), either way, it’s a problem that’s coming.

The compute race is getting stranger

Anthropic leased xAI’s entire Colossus 1 supercluster in Memphis: more than 200,000 GPUs and 300 megawatts of power. A month before that deal, Anthropic expanded its agreement with Google and Broadcom for 3.5 gigawatts of capacity coming online in 2027. For context, that’s roughly 10 times the power output of the Colossus 1 deal, in a single contract. After this episode aired, Anthropic announced that that deal has been expanded to Colossus 2 as well.

Box Elder County, Utah, just approved a 40,000-acre AI data center called the Stratos project, backed by investor and TV personality Kevin O’Leary (a.k.a. Mr. Wonderful). It’s planned for 9 gigawatts at full buildout. That’s a footprint more than twice the size of Manhattan, powered by the equivalent of nine commercial nuclear reactors. And like many data center deals going forward, including Colossus above, it was approved over local protests.

Infrastructure at this incredible scale takes years to come online, and the companies making these bets are pricing in a world where model capability keeps scaling. Whether that assumption holds will determine a lot about what’s economically viable to build in the next decade.

The harness matters more than the model

John was on hand to rethink the agent harness, which as he pointed out, entered a new phase with the step change in model capability that occurred in November and December of last year. He took Eric through the arc of AI product development, from document completion and chat loops to tool-calling agents, DAG-based workflows, and now the harness era represented by tools like Claude Code. Each progression added capability, John noted, but also complexity, and each generated a new class of problems around reliability and control. In our current moment, which John has dubbed the “age of the unharnessed agent,” agents are now within reach of everyone, not just software developers.

The payoff of this “unharnessed” era is control. John described a client engagement where he replaced a bespoke application with a skills-driven agent. Now domain experts with no development experience can read the agent’s behavior written in plain English and better understand it. As John explained,

Rather than building a bespoke agent. . ., I just built something that was just the agent harness—the agent—and I just gave it skills that describe what basically I learned in interviewing their experts, how they would work with these agents. And it worked perfectly. Not only does the agent stay on track and do what it needs to do these days, but it’s coded, as far as my client is concerned, in English.

The experts don’t have to complain to developers “this doesn’t work.” The experts can look at the English description of what’s going on and see problems, and maybe even fix it themselves. And I’m really excited to basically give that power into the hands of the people that know best how to change it, the experts.

That’s a different relationship between the experts and the tool than anything a wrapped commercial product offers.

As Eric pointed out, recent Stanford research supports this broader point: Performance gaps between a bare model and a well-designed harness now often matter more than which underlying model you’re using. The benchmark that used to dominate buying decisions, which model scores highest, has been displaced by a harder question about which harness fits the task.

John closed with a demo of his personal agent moving from an Obsidian notebook into Wikipedia and back, carrying context across environments. He used it to illustrate a concept he called the “open agent protocol,” his term for a not-yet-existing standard where an agent receives environment-specific skills as it moves between contexts. The protocol doesn’t exist yet, but the demo made the direction clear.

What’s next

Join us and a rotating lineup of expert guests for weekly live tool demos and deeper dives into the topics that matter in AI. We’re taking next week off for Memorial Day in the US, but we’ll be back on June 1 with host Andreas Welsch and guests Maya Mikhailov and Doug Shannon to cut through another week of AI headlines and separate what actually drives business value from what looks good in a demo but goes nowhere in production. Our first few episodes are free and open to all if you’d like to attend live—register here.

We’ll continue to share full episodes and publish our takeaways here on Radar each Friday. You can also watch or listen on YouTube, Spotify, Apple, or wherever you get your podcasts.

[ad_2]

Source link

AI’s security problem is now a policy problem

The compute race is getting stranger

The harness matters more than the model

What’s next

Useful Links

Edtior's Picks

Latest Articles

Rethinking the Agent Harness – O’Reilly

AI’s security problem is now a policy problem

The compute race is getting stranger

The harness matters more than the model

What’s next

Delarno

Exploring New Approaches To Sleep Apnea Care

Premier League finale: Arsenal, relegation, Champions League, Salah and Pep | Football News

You may also like

Moving sales and service organizations forward with agentic CX and Microsoft 365...

Tiny robot boats build floating structures | MIT News

LLM Orchestration Frameworks Compared: LangChain vs. LlamaIndex vs. Raw API Calls

Quantum mechanics once baffled scientists. Now it’s changing the world

Your identity stack was built for two kinds of actor. Agents are...

Posit AI Blog: Getting into the flow: Bijectors in TensorFlow Probability

Leave a Comment Cancel Reply

Useful Links

Edtior's Picks

Latest Articles