Table of Contents
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Wells Fargo has quietly accomplished what most enterprises are still dreaming about: building a large-scale, production-ready generative AI system that actually works. In 2024 alone, the bank’s AI-powered assistant, Fargo, handled 245.4 million interactions – more than doubling its original projections – and it did so without ever exposing sensitive customer data to a language model.
Fargo helps customers with everyday banking needs via voice or text, handling requests such as paying bills, transferring funds, providing transaction details, and answering questions about account activity. The assistant has proven to be a sticky tool for users, averaging multiple interactions per session.
The system works through a privacy-first pipeline. A customer interacts via the app, where speech is transcribed locally with a speech-to-text model. That text is then scrubbed and tokenized by Wells Fargo’s internal systems, including a small language model (SLM) for personally identifiable information (PII) detection. Only then is a call made to Google’s Flash 2.0 model to extract the user’s intent and relevant entities. No sensitive data ever reaches the model.
“The orchestration layer talks to the model,” Wells Fargo CIO Chintan Mehta said in an interview with VentureBeat. “We’re the filters in front and behind.”
The only thing the model does, he explained, is determine the intent and entity based on the phrase a user submits, such as identifying that a request involves a savings account. “All the computations and detokenization, everything is on our end,” Mehta said. “Our APIs… none of them pass through the LLM. All of them are just sitting orthogonal to it.”
Wells Fargo’s internal stats show a dramatic ramp: from 21.3 million interactions in 2023 to more than 245 million in 2024, with over 336 million cumulative interactions since launch. Spanish language adoption has also surged, accounting for more than 80% of usage since its September 2023 rollout.
This architecture reflects a broader strategic shift. Mehta said the bank’s approach is grounded in building “compound systems,” where orchestration layers determine which model to use based on the task. Gemini Flash 2.0 powers Fargo, but smaller models like Llama are used elsewhere internally, and OpenAI models can be tapped as needed.
“We’re poly-model and poly-cloud,” he said, noting that while the bank leans heavily on Google’s cloud today, it also uses Microsoft’s Azure.
Mehta says model-agnosticism is essential now that the performance delta between the top models is tiny. He added that some models still excel in specific areas—Claude Sonnet 3.7 and OpenAI’s o3 mini high for coding, OpenAI’s o3 for deep research, and so on—but in his view, the more important question is how they’re orchestrated into pipelines.
Context window size remains one area where he sees meaningful separation. Mehta praised Gemini 2.5 Pro’s 1M-token capacity as a clear edge for tasks like retrieval augmented generation (RAG), where pre-processing unstructured data can add delay. “Gemini has absolutely killed it when it comes to that,” he said. For many use cases, he said, the overhead of preprocessing data before deploying a model often outweighs the benefit.
Fargo’s design shows how large context models can enable fast, compliant, high-volume automation – even without human intervention. And that’s a sharp contrast to competitors. At Citi, for example, analytics chief Promiti Dutta said last year that the risks of external-facing large language models (LLMs) were still too high. In a talk hosted by VentureBeat, she described a system where assist agents don’t speak directly to customers, due to concerns about hallucinations and data sensitivity.
Wells Fargo solves these concerns through its orchestration design. Rather than relying on a human in the loop, it uses layered safeguards and internal logic to keep LLMs out of any data-sensitive path.
Agentic moves and multi-agent design
Wells Fargo is also moving toward more autonomous systems. Mehta described a recent project to re-underwrite 15 years of archived loan documents. The bank used a network of interacting agents, some of which are built on open source frameworks like LangGraph. Each agent had a specific role in the process, which included retrieving documents from the archive, extracting their contents, matching the data to systems of record, and then continuing down the pipeline to perform calculations – all tasks that traditionally require human analysts. A human reviews the final output, but most of the work ran autonomously.
The bank is also evaluating reasoning models for internal use, where Mehta said differentiation still exists. While most models now handle everyday tasks well, reasoning remains an edge case where some models clearly do it better than others, and they do it in different ways.
Why latency (and pricing) matter
At Wayfair, CTO Fiona Tan said Gemini 2.5 Pro has shown strong promise, especially in the area of speed. “In some cases, Gemini 2.5 came back faster than Claude or OpenAI,” she said, referencing recent experiments by her team.
Tan said that lower latency opens the door to real-time customer applications. Currently, Wayfair uses LLMs for mostly internal-facing apps—including in merchandising and capital planning—but faster inference might let them extend LLMs to customer-facing products like their Q&A tool on product detail pages.
Tan also noted improvements in Gemini’s coding performance. “It seems pretty comparable now to Claude 3.7,” she said. The team has begun evaluating the model through products like Cursor and Code Assist, where developers have the flexibility to choose.
Google has since released aggressive pricing for Gemini 2.5 Pro: $1.24 per million input tokens and $10 per million output tokens. Tan said that pricing, plus SKU flexibility for reasoning tasks, makes Gemini a strong option going forward.
The broader signal for Google Cloud Next
Wells Fargo and Wayfair’s stories land at an opportune moment for Google, which is hosting its annual Google Cloud Next conference this week in Las Vegas. While OpenAI and Anthropic have dominated the AI discourse in recent months, enterprise deployments may quietly swing back toward Google’s favor.
At the conference, Google is expected to highlight a wave of agentic AI initiatives, including new capabilities and tooling to make autonomous agents more useful in enterprise workflows. Already at last year’s Cloud Next event, CEO Thomas Kurian predicted agents will be designed to help users “achieve specific goals” and “connect with other agents” to complete tasks — themes that echo many of the orchestration and autonomy principles Mehta described.
Wells Fargo’s Mehta emphasized that the real bottleneck for AI adoption won’t be model performance or GPU availability. “I think this is powerful. I have zero doubt about that,” he said, about generative AI’s promise to return value for enterprise apps. But he warned that the hype cycle may be running ahead of practical value. “We have to be very thoughtful about not getting caught up with shiny objects.”
His bigger concern? Power. “The constraint isn’t going to be the chips,” Mehta said. “It’s going to be power generation and distribution. That’s the real bottleneck.”
Source link