Home Tech & Science Gen AI Needs Synthetic Data. We Need to Be Able to Trust It

Gen AI Needs Synthetic Data. We Need to Be Able to Trust It

by Delarno
0 comments
Gen AI Needs Synthetic Data. We Need to Be Able to Trust It


Today’s generative AI models, like those behind ChatGPT and Gemini, are trained on reams of real-world data, but even all the content on the internet is not enough to prepare a model for every possible situation.

To continue to grow, these models need to be trained on simulated or synthetic data, which are scenarios that are plausible, but not real. AI developers need to do this responsibly, experts said on a panel at South by Southwest, or things could go haywire quickly.

The use of simulated data in training artificial intelligence models has gained new attention this year since the launch of DeepSeek AI, a new model produced in China that was trained using more synthetic data than other models, saving money and processing power.

But experts say it’s about more than saving on the collection and processing of data. Synthetic data — computer generated often by AI itself — can teach a model about scenarios that don’t exist in the real-world information it’s been provided but that it could face in the future. That one-in-a-million possibility doesn’t have to come as a surprise to an AI model if it’s seen a simulation of it.

“With simulated data, you can get rid of the idea of edge cases, assuming you can trust it,” said Oji Udezue, who has led product teams at Twitter, Atlassian, Microsoft and other companies. He and the other panelists were speaking on Sunday at the SXSW conference in Austin, Texas. “We can build a product that works for 8 billion people, in theory, as long as we can trust it.”

The hard part is ensuring you can trust it.

The problem with simulated data

Simulated data has a lot of benefits. For one, it costs less to produce. You can crash test thousands of simulated cars using some software, but to get the same results in real life, you have to actually smash cars — which costs a lot of money — Udezue said.

If you’re training a self-driving car, for instance, you’d need to capture some less common scenarios that a vehicle might experience on the roads, even if they aren’t in training data, said Tahir Ekin, a professor of business analytics at Texas State University. He used the case of the bats that make spectacular emergences from Austin’s Congress Avenue Bridge. That may not show up in training data, but a self-driving car will need some sense of how to respond to a swarm of bats.

The risks come from how a machine trained using synthetic data responds to real-world changes. It can’t exist in an alternate reality, or it becomes less useful, or even dangerous, Ekin said. “How would you feel,” he asked, “getting into a self-driving car that wasn’t trained on the road, that was only trained on simulated data?” Any system using simulated data needs to “be grounded in the real world,” he said, including feedback on how its simulated reasoning aligns with what’s actually happening.

Udezue compared the problem to the creation of social media, which began as a way to expand communication worldwide, a goal it achieved. But social media has also been misused, he said, noting that “now despots use it to control people, and people use it to tell jokes at the same time.”

As AI tools grow in scale and popularity, a scenario made easier by the use of synthetic training data, the potential real-world impacts of untrustworthy training and models becoming detached from reality grow more significant. “The burden is on us builders, scientists, to be double, triple sure that system is reliable,” Udezue said. “It’s not a fantasy.”

How to keep simulated data in check

One way to ensure models are trustworthy is to make their training transparent, that users can choose what model to use based on their evaluation of that information. The panelists repeatedly used the analogy of a nutrition label, which is easy for a user to understand.

Some transparency exists, such as model cards available through the developer platform Hugging Face that break down the details of the different systems. That information needs to be as clear and transparent as possible, said Mike Hollinger, director of product management for enterprise generative AI at chipmaker Nvidia. “Those types of things must be in place,” he said.

Hollinger said ultimately, it will be not just the AI developers but also the AI users who will define the industry’s best practices.

The industry also needs to keep ethics and risks in mind, Udezue said. “Synthetic data will make a lot of things easier to do,” he said. “It will bring down the cost of building things. But some of those things will change society.”

Udezue said observability, transparency and trust must be built into models to ensure their reliability. That includes updating the training models so that they reflect accurate data and don’t magnify the errors in synthetic data. One concern is model collapse, when an AI model trained on data produced by other AI models will get increasingly distant from reality, to the point of becoming useless.

“The more you shy away from capturing the real world diversity, the responses may be unhealthy,” Udezue said. The solution is error correction, he said. “These don’t feel like unsolvable problems if you combine the idea of trust, transparency and error correction into them.”





Source link

You may also like

Leave a Comment

Booboone

Breaking News on Health, Science, Politic, Science, Entertainment!

 

Edtior's Picks

Latest Articles

@2023 – All Right Reserved. Designed and Developed by booboone.com