Privacy-preserving domain adaptation with LLMs for mobile applications

by Delarno July 25, 2025

July 25, 2025 0 comments

Privacy-preserving domain adaptation with LLMs for mobile applications

The recent success of machine learning models relies on not only large-scale, but also high-quality data. The paradigm of pre-training on massive data collected on the web and post-training on smaller high-quality data is used to train both large and small language models (LMs). For large models, post-training has proven vital for aligning models to user intent, and post-training of small models to adapt to the user domain has yielded significant results, for example, achieving 3%–13% improvements in key production metrics for mobile typing applications.

However, in complex LM training systems, there are potential privacy risks, such as the memorization of sensitive user instruction data. Privacy-preserving synthetic data provides one path to access user interaction data to improve models while systematically minimizing privacy risks. With the generation capabilities of large LMs (LLMs), synthetic data can be created to mimic user data without risk of memorization. This synthetic data can then be used in model training just as public data is used, simplifying privacy-preserving model training.

Gboard uses both small LMs and LLMs to improve billions of users’ typing experience. Small LMs support core features like slide to type, next word prediction (NWP), smart compose, smart completion and suggestion; LLMs support advanced features like proofread. In this blog post, we share our exploration over the past few years on generating and using synthetic data to improve LMs for mobile typing applications. We focus on approaches adhering to the privacy principles of both data minimization and data anonymization, and show how they are making a real-world impact in small and large models in Gboard. Particularly, our recent paper, “Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications”, discusses the advances in privacy-preserving synthetic data for LLMs in production, building upon our continuous research efforts discussed below [1, 2, 3, 4, 5].

Source link

Delarno

I Am Who I Am, to not become what people want me to be.

Useful Links

Edtior's Picks

Latest Articles

Privacy-preserving domain adaptation with LLMs for mobile applications

Delarno

Operation Midnight and the President’s War Powers – The Cipher Brief

Where to go in Portugal for your breakaway

You may also like

From Signals to Insights: Building a Real-Time Streaming Data Platform with Fabric...

Robot, know thyself: New vision-based system teaches machines to understand their bodies...

AI at light speed: How glass fibers could replace silicon brains

Just-in-time compilation (JIT) for R-less model deployment

Fighting forever chemicals and startup fatigue

Measuring heart rate with consumer ultra-wideband radar

Leave a Comment Cancel Reply

Useful Links

Edtior's Picks

Latest Articles