MIT researchers have made significant progress addressing the challenge of protecting sensitive data encoded within machine learning models. A team of scientists has developed a machine learning model that can accurately predict whether a patient has cancer from lung scan images. However, sharing the model with hospitals around the world poses a significant risk of potential data exfiltration by malicious actors. To address this issue, researchers have introduced a novel privacy metric called Probably Approximately Correct (PAC) Privacy, along with a framework that determines the minimum amount of noise needed to protect sensitive data.
Conventional privacy approaches, such as differential privacy, focus on preventing an adversary from distinguishing specific data usage by adding huge amounts of noise, which reduces the accuracy of the model. PAC Privacy takes a different perspective when evaluating how difficult it is for an adversary to reconstruct pieces of sensitive data, even after the noise has been added. For example, if the sensitive data is human faces, differential privacy would prevent the adversary from determining whether a specific person’s face was in the data set. By contrast, PAC Privacy explores whether an adversary could extract a rough silhouette that could be recognized as the face of a particular individual.
To implement PAC Privacy, the researchers developed an algorithm that determines the optimal amount of noise to add to a model, ensuring privacy even against adversaries with infinite computing power. The algorithm is based on the uncertainty or entropy of the original data from the perspective of the adversary. By subsampling the data and running the machine learning training algorithm multiple times, the algorithm compares the variance between different outputs to determine the necessary amount of noise. A smaller variation indicates that less noise is required.
One of the key advantages of the PAC Privacy algorithm is that it does not require knowledge of the inner workings of the model or the training process. Users can specify a desired level of confidence regarding the adversary’s ability to reconstruct sensitive data, and the algorithm provides the optimal amount of noise to achieve that goal. However, it is important to note that the algorithm does not estimate the loss of precision resulting from adding noise to the model. Additionally, implementing PAC Privacy can be computationally expensive due to repeated training of machine learning models on multiple subsampled data sets.
To improve the privacy of PAC, the researchers suggest modifying the machine learning training process to increase stability, which reduces the variation between subsample results. This approach would reduce the computational load on the algorithm and minimize the amount of noise needed. Also, more stable models often exhibit lower generalization errors, leading to more accurate predictions on new data.
While the researchers acknowledge the need for further exploration of the relationship between stability, privacy, and generalization error, their work presents a promising advance in the protection of sensitive data in machine learning models. By leveraging the privacy of PAC, engineers can develop models that protect training data while maintaining accuracy in real-world applications. With the potential to significantly reduce the amount of noise required, this technique opens up new possibilities for secure data exchange in healthcare and beyond.
review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 26k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
🚀 Check out over 800 AI tools at AI Tools Club
Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic individual with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.