
/
Audio is being added to AI everywhere: both in multimodal models that can understand and generate audio and in applications that use audio for input. Now that we can work with spoken language, what does that mean for the applications that we can develop? How do we think about audio interfaces—how will people use them, and what will they want to do? Raiza Martin, who worked on Google’s groundbreaking NotebookLM, joins Ben Lorica to discuss how she thinks about audio and what you can build with it.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
Timestamps
- 0:00: Introduction to Raiza Martin, who cofounded Huxe and formerly led Google’s NotebookLM team. What made you think this was the time to trade the comforts of big tech for a garage startup?
- 1:01: It was a personal decision for all of us. It was a pleasure to take NotebookLM from an idea to something that resonated so widely. We realized that AI was really blowing up. We didn’t know what it would be like at a startup, but we wanted to try. Seven months down the road, we’re having a great time.
- 1:54: For the 1% who aren’t familiar with NotebookLM, give a short description.
- 2:06: It’s basically contextualized intelligence, where you give NotebookLM the sources you care about and NotebookLM stays grounded to those sources. One of our most common use cases was that students would create notebooks and upload their class materials, and it became an expert that you could talk with.
- 2:43: Here’s a use case for homeowners: put all your user manuals in there.
- 3:14: We have had a lot of people tell us that they use NotebookLM for Airbnbs. They put all the manuals and instructions in there, and users can talk to it.
- 3:41: Why do people need a personal daily podcast?
- 3:57: There are a lot of different ways that I think about building new products. On one hand, there are acute pain points. But Huxe comes from a different angle: What if we could try to build very delightful things? The inputs are a little different. We tried to imagine what the average person’s daily life is like. You wake up, you check your phone, you travel to work; we thought about opportunities to make something more delightful. I think a lot about TikTok. When do I use it? When I’m standing in line. We landed on transit time or commute time. We wanted to do something novel and interesting with that space in time. So one of the first things was creating really personalized audio content. That was the provocation: What do people want to listen to? Even in this short time, we’ve learned a lot about the amount of opportunity.
- 6:04: Huxe is mobile first, audio first, right? Why audio?
- 6:45: Coming from our learnings from NotebookLM, you learn fundamentally different things when you change the modality of something. When I go on walks with ChatGPT, I just talk about my day. I noticed that was a very different interaction from when I type things out to ChatGPT. The flip side is less about interaction and more about consumption. Something about the audio format made the types of sources different as well. The sources we uploaded to NotebookLM were different as a result of wanting audio output. By focusing on audio, I think we’ll learn different use cases than the chat use cases. Voice is still largely untapped.
- 8:24: Even in text, people started exploring other form factors: long articles, bullet points. What kinds of things are available for voice?
- 8:49: I think of two formats: one passive and one interactive. With passive formats, there are a lot of different things you can create for the user. The things you end up playing with are (1) what is the content about and (2) how flexible is the content? Is it short, long, malleable to user feedback? With interactive content, maybe I’m listening to audio, but I want to interact with it. Maybe I want to join in. Maybe I want my friends to join in. Both of those contexts are new. I think this is what’s going to emerge in the next few years. I think we’ll learn that the types of things we will use audio for are fundamentally different from the things we use chat for.
- 10:19: What are some of the key lessons to avoid from smart speakers?
- 10:25: I’ve owned so many of them. And I love them. My primary use for the smart speakers is still a timer. It’s expensive and doesn’t live up to the promise. I just don’t think the technology was ready for what people really wanted to do. It’s hard to think about how that could have worked without AI. Second, one of the most difficult things about audio is that there is no UI. A smart speaker is a physical device. There’s nothing that tells you what to do. So the learning curve is steep. So now you have a user who doesn’t know what they can use the thing for.
- 12:20: Now it can do so much more. Even without a UI, the user can just try things. But there’s a risk in that it still requires input from the user. How do we think about a system that is so supportive that you don’t have to come up with how to make it work? That’s the challenge from the smart speaker era.
- 12:56: It’s interesting that you point out the UI. With a chatbot you have to type something. With a smart speaker, people started getting creeped out by surveillance. So, will Huxe surveil me?
- 13:18: I think there’s something simple about it, which is the wake word. Because smart speakers are triggered by wake words, they are always on. If the user says something, it’s probably picking it up, and it’s probably logged somewhere. With Huxe, we want to be really careful about where we believe consumer readiness is. You want to push a little bit but not too far. If you push too far, people get creeped out.
- 14:32: For Huxe, you have to turn it on to use it. It’s clunky in some ways, but we can push on that boundary and see if we can push for something that’s more ambiently on. We’re starting to see the emergence of more tools that are always on. There are tools like Granola and Cluely: They’re always on, looking at your screen, transcribing your audio. I’m curious—are we ready for technology like that? In real life, you can probably get the most utility from something that is always on. But whether consumers are ready is still TBD.
- 15:25: So you’re ingesting calendars, email, and other things from the users. What about privacy? What are the steps you’ve taken?
- 15:48: We’re very privacy focused. I think that comes from building NotebookLM. We wanted to make sure we were very respectful of user data. We didn’t train on any user data; user data stayed private. We’re taking the same approach with Huxe. We use the data you share with Huxe to improve your personal experience. There’s something interesting in creating personal recommendation models that don’t go beyond your usage of the app. It’s a little harder for us to build something good, but it respects privacy, and that’s what it takes to get people to trust.
- 17:08: Huxe may notice that I have a flight tomorrow and tell me that the flight is delayed. To do so, it has had to contact an external service, which now knows about my flight.
- 17:26: That’s a good point. I think about building Huxe like this: If I were in your pocket, what would I do? If I saw a calendar that said “Ben has a flight,” I can check that flight without leaking your personal information. I can just look up the flight number. There are a lot of ways you can do something that provides utility but doesn’t leak data to another service. We’re trying to understand things that are much more action oriented. We try to tell you about weather, about traffic; these are things we can do without stepping on user privacy.
- 18:38: The way you described the system, there’s no social component. But you end up learning things about me. So there is the potential for building a more sophisticated filter bubble. How do you make sure that I’m ingesting things beyond my filter bubble?
- 19:08: It comes down to what I believe a person should or shouldn’t be consuming. That’s always tricky. We’ve seen what these feeds can do to us. I don’t know the correct formula yet. There’s something interesting about “How do I get enough user input so I can give them a better experience?” There’s signal there. I try to think about a user’s feed from the perspective of relevance and less from an editorial perspective. I think the relevance of information is probably enough. We’ll probably test this once we start surfacing more personalized information.
- 20:42: The other thing that’s really important is surfacing the correct controls: I like this; here’s why. I don’t like this; why not? Where you inject tension in the system, where you think the system should push back—that takes a little time to figure out how to do it right.
- 21:01: What about the boundary between giving me content and providing companionship?
- 21:09: How do we know the difference between an assistant and a companion? Fundamentally the capabilities are the same. I don’t know if the question matters. The user will use it how the user intends to use it. That question matters most in the packaging and the marketing. I talk to people who talk about ChatGPT as their best friend. I talk to others who talk about it as an employee. On a capabilities level, they’re probably the same thing. On a marketing level, they’re different.
- 22:22: For Huxe, the way I think about this is which set of use cases you prioritize. Beyond a simple conversation, the capabilities will probably start diverging.
- 22:47: You’re now part of a very small startup. I assume you’re not building your own models; you’re using external models. Walk us through privacy, given that you’re using external models. As that model learns more about me, how much does that model retain over time? To be a really good companion, you can’t be clearing that cache every time I log out.
- 23:21: That question pertains to where we store data and how it’s passed off. We opt for models that don’t train on the data we send them. The next layer is how we think about continuity. People expect ChatGPT to have knowledge of all the conversations you have.
- 24:03: To support that you have to build a very durable context layer. But you don’t have to imagine that all of that gets passed to the model. A lot of technical limitations prevent you from doing that anyway. That context is stored at the application layer. We store it, and we try to figure out the right things to pass to the model, passing as little as possible.
- 25:17: You’re from Google. I know that you measure, measure, measure. What are some of the signals you measure?
- 25:40: I think about metrics a little differently in the early stages. Metrics in the beginning are nonobvious. You’ll get a lot of trial behavior in the beginning. It’s a little harder to understand the initial user experience from the raw metrics. There are some basic metrics that I care about—the rate at which people are able to onboard. But as far as crossing the chasm (I think of product building as a series of chasms that never end), you look for people who really love it, who rave about it; you have to listen to them. And then the people who used the product and hated it. When you listen to them, you discover that they expected it to do something and it didn’t. It let them down. You have to listen to these two groups, and then you can triangulate what the product looks like to the outside world. The thing I’m trying to figure out is less “Is it a hit?” but “Is the market ready for it? Is the market ready for something this weird?” In the AI world, the reality is that you’re testing consumer readiness and need, and how they are evolving together. We did this with NotebookLM. When we showed it to students, there was zero time between when they saw it and when they understood it. That’s the first chasm. Can you find people who understand what they think it is and feel strongly about it?
- 28:45: Now that you’re outside of Google, what would you want the foundation model builders to focus on? What aspects of these models would you like to see improved?
- 29:20: We share so much feedback with the model providers—I can provide feedback to all the labs, not just Google, and that’s been fun. The universe of things right now is pretty well known. We haven’t touched the space where we’re pushing for new things yet. We always try to drive down latency. It’s a conversation—you can interrupt. There’s some basic behavior there that the models can get better at. Things like tool-calling, making it better and parallelizing it with voice model synthesis. Even just the diversity of voices, languages, and accents; that sounds basic, but it’s actually pretty hard. Those top three things are pretty well known, but it will take us through the rest of the year.
- 30:48: And narrowing the gap between the cloud model and the on-device model.
- 30:52: That’s interesting too. Today we’re making a lot of progress on the smaller on-device models, but when you think of supporting an LLM and a voice model on top of it, it actually gets a little bit hairy, where most people would just go back to commercial models.
- 31:26: What’s one prediction in the consumer AI space that you would make that most people would find surprising?
- 31:37: A lot of people use AI for companionship, and not in the ways that we imagine. Almost everyone I talk to, the utility is very personal. There are a lot of work use cases. But the emerging side of AI is personal. There’s a lot more area for discovery. For example, I use ChatGPT as my running coach. It ingests all of my running data and creates running plans for me. Where would I slot that? It’s not productivity, but it’s not my best friend; it’s just my running coach. More and more people are doing these complicated personal things that are closer to companionship than enterprise use cases.
- 33:02: You were supposed to say Gemini!
- 33:04: I love all of the models. I have a use case for all of them. But we all use all the models. I don’t know anyone who only uses one.
- 33:22: What you’re saying about the nonwork use cases is so true. I come across so many people who treat chatbots as their friends.
- 33:36: I do it all the time now. Once you start doing it, it’s a lot stickier than the work use cases. I took my dog to get groomed, and they wanted me to upload his rabies vaccine. So I started thinking about how well it’s protected. I opened up ChatGPT, and spent eight minutes talking about rabies. People are becoming more curious, and now there’s an immediate outlet for that curiosity. It’s so much fun. There’s so much opportunity for us to continue to explore that.
- 34:48: Doesn’t this indicate that these models will get sticky over time? If I talk to Gemini a lot, why would I switch to ChatGPT?
- 35:04: I agree. We see that now. I like Claude. I like Gemini. But I really like the ChatGPT app. Because the app is a good experience, there’s no reason for me to switch. I’ve talked to ChatGPT so much that there’s no way for me to port my data. There’s data lock-in.