Home Tech & ScienceArtificial Intelligence (AI)Privacy-preserving domain adaptation with LLMs for mobile applications

Privacy-preserving domain adaptation with LLMs for mobile applications

by Delarno
0 comments
Privacy-preserving domain adaptation with LLMs for mobile applications


The recent success of machine learning models relies on not only large-scale, but also high-quality data. The paradigm of pre-training on massive data collected on the web and post-training on smaller high-quality data is used to train both large and small language models (LMs). For large models, post-training has proven vital for aligning models to user intent, and post-training of small models to adapt to the user domain has yielded significant results, for example, achieving 3%–13% improvements in key production metrics for mobile typing applications.

However, in complex LM training systems, there are potential privacy risks, such as the memorization of sensitive user instruction data. Privacy-preserving synthetic data provides one path to access user interaction data to improve models while systematically minimizing privacy risks. With the generation capabilities of large LMs (LLMs), synthetic data can be created to mimic user data without risk of memorization. This synthetic data can then be used in model training just as public data is used, simplifying privacy-preserving model training.

Gboard uses both small LMs and LLMs to improve billions of users’ typing experience. Small LMs support core features like slide to type, next word prediction (NWP), smart compose, smart completion and suggestion; LLMs support advanced features like proofread. In this blog post, we share our exploration over the past few years on generating and using synthetic data to improve LMs for mobile typing applications. We focus on approaches adhering to the privacy principles of both data minimization and data anonymization, and show how they are making a real-world impact in small and large models in Gboard. Particularly, our recent paper, “Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications”, discusses the advances in privacy-preserving synthetic data for LLMs in production, building upon our continuous research efforts discussed below [1, 2, 3, 4, 5].



Source link

You may also like

Leave a Comment