Large deep learning models are becoming the workhorse of a variety of critical machine learning (ML) tasks. However, it has been shown that, without any protection, it is plausible for criminals to attack a variety of models, in all modalities, to reveal information from individual training samples. As such, it is essential to protect against this type of information leakage.
differential privacy (DP) provides formal protection against an attacker who aims to extract information about the training data. The most popular method for deep learning PD training is differentially private stochastic gradient descent (DP-SGD). The core recipe implements a common theme in DP: “blurring” the outputs of a noisy algorithm to hide the contributions of any individual input.
In practice, PD training can be very expensive or even ineffective for very large models. Not only does the computational cost generally increase when privacy guarantees are required, but the noise also increases proportionally. Given these challenges, there has recently been a lot of interest in developing methods that allow efficient PD formation. The aim is to develop simple and practical methods to produce high-quality large-scale private models.
He ImageNet Rank Comparison is an effective test bed for this purpose because 1) it is a challenging task even in a non-private environment, requiring large enough models to successfully classify a large number of varied images and 2) it is a public and code dataset open, which other researchers can access and use for collaboration. With this approach, researchers can simulate a practical situation where a large model is required to train on private data with PD guarantees.
To that end, today we take a look at the improvements we’ve made to training large-scale, highly useful private models. First in “Large-scale transfer learning for differentially private image classification”, we share strong results on the challenging task of image classification on the ImageNet-1k dataset with PD constraints. We show that with a large-scale combination transfer learning and carefully chosen hyperparameters in fact, it is possible to significantly reduce the gap between private and non-private performance even in challenging tasks and high-dimensional models. Then in “Differentially classifying private images from features“, we further demonstrate that private tuning of the last layer of the pretrained model with more advanced optimization algorithms further improves performance, leading to new next-generation PD results on a variety of popular image classification benchmarks. , including ImageNet-1k.To encourage further development in this direction and to allow other researchers to verify our findings, we are also publishing the associated file source code.
transfer of learning and differential privacy
The main idea behind transfer learning is to reuse the knowledge gained from solving a problem and then apply it to a related problem. This is especially useful when limited or low-quality data is available for the target problem, as it allows us to leverage insights gained from a larger and more diverse public dataset.
In the context of PD, transfer learning has emerged as a promising technique to improve the accuracy of private models, taking advantage of knowledge learned from pre-training tasks. For example, if a model has already been trained on a large public dataset for a similar privacy-sensitive task, it can be fit on a smaller, more specific dataset for the target DP task. More specifically, one first pretrains a model on a large dataset with no privacy concerns, and then privately tunes the model on the sensitive dataset. In our work, we improve the effectiveness of PD transfer learning and illustrate it by simulating private training on publicly available data sets, namely ImageNet-1k, CIFAR-100 and CIFAR-10.
Better pre-training improves PD performance
To begin exploring how transfer learning can be effective for differential private image classification tasks, we carefully examine the hyperparameters that affect DP performance. Surprisingly, we found that with carefully chosen hyperparameters (eg, initializing the last layer to zero and choosing large batch sizes), privately tuning only the last layer of a pretrained model yields significant improvements over the bottom line. base. Training only the last layer also significantly improves the cost-utility of training a high-quality image classification model with DP.
As shown below, we compared the ImageNet performance of the best hyperparameter recommendations with and without privacy and on a variety of models and sizes of pretraining datasets. We found that scaling the model and using a larger pretraining dataset lessens the gap in accuracy that arises from the addition of the privacy guarantee. Typically, the privacy guarantees of a system are characterized by a positive ε parameter, with smaller ε corresponding to better privacy. In the figure below, we use the privacy guarantee of ε = 10.
Comparison of our best models with and without privacy on ImageNet between models and pre-training data set sizes. The X axis shows the different vision transformer models we used for this study in ascending order of model size from left to right. We use JFT-300M to pretrain models B/16, L/16 and H/14, JFT-4B (a larger version of JFT-3B) to pretrain H/14-4b and JFT-3B to pretrain G/14-3b. We do this to study the effectiveness of co-scaling the model and the pretraining dataset (JFT-3B or 4B). The Y axis shows the Top-1 Accuracy on the ImageNet-1k test set once the model is fitted (privately or non-privately) with the ImageNet-1k training set. We consistently see that the scale of the model and the size of the pretraining data set decrease the gap in accuracy that arises from the addition of the privacy guarantee of ε = 10. |
Better optimizers improve DP performance
Surprisingly, we found that privately training only the last layer of a previously trained model provides the best utility with DP. Although previous studies [1, 2, 3] relied heavily on the use of private first-order differentially training algorithms like DP-SGD to train large models, in the specific case of privately learning only the last layer of the features, we observed that the computational load often is low enough to allow more sophisticated optimization schemes, including second-order methods (for example, newton either Quasi-Newton methods), which can be more accurate but also more computationally expensive.
In “Differentially classifying private images from features”, we systematically explored the effect of loss functions and optimization algorithms. We found that while the commonly used Logistic regression performs better than linear regression in the non-private setting, the situation is reversed in the private setting: least-squares linear regression is much more effective than logistic regression from both a privacy and a computational standpoint for the typical range of values of ε ([1, 10]), and even more efficient for more stringent epsilon values (ε < 1).
We further explore the use of DP Newton’s method to solve logistic regression. We find that this is still outperformed by PD linear regression in the high privacy regime. In fact, Newton’s method consists of calculating a burlap (a matrix that captures second-order information), and making this matrix differentially private requires adding much more noise in logistic regression than in linear regression, which has a highly structured Hessian.
Based on this observation, we introduce a method that we call Differentially private SGD with characteristic covariance (DP-FC), where we simply replace the hessian in the logistic regression with the covariance of privatized features. Since the feature covariance only depends on the inputs (and not on the model parameters or class labels), we can share it between classes and training iterations, which greatly reduces the amount of noise that must be added to protect her. This allows us to combine the benefits of using logistic regression with the efficient privacy protection of linear regression, leading to a better tradeoff between privacy and utility.
With DP-FC, we significantly outperformed previous cutting-edge results in three private image classification benchmarks, namely ImageNet-1k, CIFAR-10, and CIFAR-100, simply by fine-tuning DP on features drawn from a powerful pre-trained model.
comparison of top-1 precisions (Y axis) with private fine tuning using the DP-FC method on the three data sets in a range of ε (X axis). We observe that better pretraining helps even more for lower values of ε (stricter privacy guarantee). |
Conclusion
We show that large-scale pretraining on a public dataset is an effective strategy for good results when tuning privately. Furthermore, scaling both the size of the model and the pre-training data set improves the performance of the private model and reduces the quality gap compared to the non-private model. In addition, we provide strategies for effectively using transfer learning for the PD. Please note that this work has several limitations. worth considering — Most importantly, our approach is based on the availability of a large and reliable public dataset, which can be challenging to obtain and examine. We hope our work will be useful for training large models with significant privacy guarantees!
Thanks
In addition to the authors of this blog post, this research was conducted by Abhradeep Thakurta, Alex Kurakin, and Ashok Cutkosky. We also thank the developers of the Jax, Flax, and Scenic libraries. Specifically, we’d like to thank Mostafa Dehghani for helping us with the performance and scenic view baselines and Lucas Beyer for his help with JFT data deduplication. We also thank Li Zhang, Emil Praun, Andreas Terzis, Shuang Song, Pierre Tholoniat, Roxana Geambasu, and Steve Chien for stimulating discussions about differential privacy throughout the project. In addition, we are grateful to the anonymous reviewers, Gautam Kamath and Varun Kanade, for their helpful comments during the publication process. Finally, we would like to thank John Anderson and Corinna Cortes of Google Research, Borja Balle, Soham De, Sam Smith, Leonard Berrada, and Jamie Hayes of DeepMind for their generous comments.