Google ai researchers describe their novel approach to address the challenge of generating high-quality synthetic data sets that preserve user privacy, which are essential for training predictive models without compromising sensitive information. As machine learning models increasingly rely on large data sets, ensuring the privacy of the people whose data contributes to these models becomes crucial. Differentially private synthetic data is synthesized by creating new data sets that reflect the key characteristics of the original data but are completely artificial, protecting user privacy and enabling robust model training.
Current methods for generating privacy-preserving data involve training models directly with differentially private machine learning (DP-ML) algorithms, which provide strong privacy guarantees. However, when working with high-dimensional data sets used for a variety of tasks, this method can be computationally demanding and can only occasionally produce high-quality results. Previous models, such as Leveraging Large Language Models, have leveraged large language models (LLM) combined with differentially private stochastic gradient descent (DP-SGD) to generate private synthetic data. This method involves fine-tuning an LLM trained on public data using DP-SGD on a sensitive data set, ensuring that the generated synthetic data does not reveal any specific information about the individuals in the sensitive data set.
Google researchers proposed an improved approach to generate differentially private synthetic data by leveraging efficient parameter tuning techniques, such as LoRa (low-rank adaptation) and fast tuning. These techniques aim to modify a smaller number of parameters during the private training process, reducing computational overhead and potentially improving the quality of the synthetic data.
The first step of the approach is to train LLM on a large corpus of public data. The LLM is then fitted using DP-SGD on the sensitive data set, and the fitting process is restricted to a subset of the model parameters. LoRa fine-tuning involves replacing each W in the model with W+LR, where L and R are low-rank matrices, and you only train L and R. Fast fine-tuning, on the other hand, involves inserting a “fast tensor” at the beginning of the network and only trains its weights, effectively modifying only the input message used by the LLM.
Empirical results showed that LoRa fine-tuning, which modifies about 20 million parameters, outperforms both full-parameter fine-tuning and cue-based tuning, which modifies only about 41 thousand parameters. This suggests that there is an optimal number of parameters that balance the trade-off between computational efficiency and data quality. Classifiers trained on synthetic data generated by LLM fine-tuned by LoRa outperformed those trained on synthetic data from other fine-tuning methods and, in some cases, classifiers trained directly on the original sensitive data using DP-SGD. In an experiment to evaluate the proposed approach, a decoder-only LLM (Lamda-8B) was trained on public data and then privately tuned on three publicly available datasets, namely IMDB, Yelp and AG News, and treated as sensitive. The generated synthetic data was used to train classifiers on tasks such as sentiment analysis and topic classification. The performance of the classifiers on reserved subsets of the original data demonstrated the effectiveness of the proposed method.
In conclusion, Google's approach to generating differentially private synthetic data using efficient parameter tuning techniques has outperformed existing methods. By fitting a smaller subset of parameters, the method reduces computational requirements and improves the quality of the synthetic data. This approach not only preserves privacy but also maintains high utility for training predictive models, making it a valuable tool for organizations looking to leverage sensitive data without compromising user privacy. The empirical results demonstrate the effectiveness of the proposed method, suggesting its potential for broader applications in privacy-preserving machine learning.
Review the Paper and Blog. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 42k+ ML SubReddit
Pragati Jhunjhunwala is a Consulting Intern at MarktechPost. She is currently pursuing B.tech from the Indian Institute of technology (IIT), Kharagpur. She is a technology enthusiast and has a keen interest in the scope of data science software and applications. She is always reading about the advancements in different fields of ai and ML.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>