How to evaluate privacy risks when using customer data and leverage privacy-enhancing technologies to minimize the risks
In this age of data-first organizations, no matter what industry you’re in, you’re most likely collecting, processing, and analyzing tons of customer data. It could be for fulfilling a customer’s service request, for legal or regulatory reasons or for providing your customers with better user experience through personalization using artificial intelligence or machine
learning. However, as per Statista, data breaches are increasing every year with 1862 reported data compromises in 2021, up 68% compared to 2020, with 83% of those involving sensitive information. Such sensitive information falling into the wrong hands could wreak havoc to the customer’s life due to identity theft, stalking, ransomware attacks etc. This coupled with the rise of privacy laws and legislations across various states has brought privacy enhancing data processing technologies to the forefront.
For AI applications such as personalization, privacy and data utility can be visualized on opposites sides of the spectrum. Data that doesn’t contain anything personal i.e., expose no traits or characteristics of the customers, lend no value for personalization. However, data containing personal information can be used to deliver highly personalized experience but if the dataset, ends up in the hands of any human can lead to loss of customer data privacy. As a result, there is always an inherent tradeoff between privacy risk and utility of that data.
Health Insurance Portability and Accountability Act (HIPAA), California Consumer Privacy Act (CCPA), Children’s Online Privacy Protection Act (COPPA), Biometric Identifier Act are just a few of the many privacy-centric laws and legislations in the US. Failure to comply with such regulations can cost an organization billions of dollars in fine. For example, recently the state of Texas sued Facebook’s parent company Meta for billions of dollars in damages for mishandling and exploiting sensitive biometric data of millions of people in the state. Being privacy-first can help avoid huge fines and not limited to losing the license to operate as a business. In addition, there can be massive loss to the consumer trust and loyalty, brand image and perception. Being negligent about consumer’s data privacy can demolish customer lifetime value, affect conversions and renewals. In fact, companies like Apple have flipped the problem on its head and in fact are using privacy as a competitive moat as a differentiator from other technology companies.
There are three key sources of privacy risk within an organization:
- Raw customer data and any of its derivatives. Raw customer data can be customer entered data such as name, address, age sex and other profile details or data on how customer is using the product such as page visits, session duration, items in cart, purchase history, payment settings etc.
- Metadata and logs. Metadata and logs include location of customer, location product website was accessed from, IP address of device, MAC address, service logs, logs of call with customer support etc.
- ML models that have been trained on customer data. ML models itself can itself seem like they don’t contain anything personal, but ML models can memorize patterns in the data it has been trained on. Models trained on critical customer data can retain customer attributable personal data within in the models and present customer personal data exposure risk regardless of whether the model was deployed in the cloud or on edge devices. If a malicious actor gains access to such a model, even as a black box, they can run series of attacks to recover the personal data leading to privacy breach.
An ML model’s security classification should be determined based on the data classification of its training data. ML model artifacts can contain plaintext customer data and the ML model itself is susceptible to privacy attacks. If an organization is running a marketplace and sharing ML models with external partners, even under NDA and data sharing agreements, ML models present high risk of privacy attacks.
Organizations that want to ensure their data privacy compliance should conduct gap analysis to identify any potential risks and weaknesses. Data privacy impact assessments (DPIAs) are an essential tool for organizations to run gap analysis. This process involves examining existing practices, policies and procedures related to privacy and data protection, to assess how well they align with the current legal requirements. Gap analysis is typically run by the Security and Data Privacy functions within an organization and as such would be run by the Data Protection Officer (DPO). Gap analysis can also be outsourced but the organization requesting it is still responsible for it.
When conducting a gap analysis, organizations need to consider all aspects of data protection including physical security measures, access control, and data encryption technologies. They should also review their policies and procedures related to information handling, data storage and sharing. Organizations should consider potential threats from external sources (e.g., cyber criminals), as well as internal threats resulting from human error or malicious intent. For example, for GDPR, it is important to not only account which users have access to customer data, but also evaluate why employees need to have access to customer data in the first place. If the use case is not justified within the pre-defined principles related to processing of personal data, the user permissions should be revoked immediately. The assessment should also consider the likelihood of various threats occurring against protected data assets and the estimated impact of each threat on the organization’s operations if realized.
Once any weaknesses have been identified, organizations can then take steps to close the gap by implementing necessary changes such as adopting new tools or updating existing policies. For example, organizations can choose to implement fine-grained access control such as access that only works for a short duration (time-bound access control), access only within a pre-defined geographic location or only from fixed set of devices or IP addresses. Additionally, they may need to create additional training sessions for employees so employees are aware of the latest data protection regulations and can take the right measures when handling customer data.
DPIA and gap analysis are not a one-time thing and organizations should consider conducting a DPIA whenever they are considering introducing new systems or practices that involve personal data. Overall, gap analysis is an essential component of maintaining an effective data privacy program within an organization. It can help reduce the risk of breaches and ensure compliance with applicable data protection laws. By taking a proactive approach towards gap analysis for data privacy compliance, organizations can better protect their customers’ sensitive information while ensuring the highest level of security for all systems and operations involved in handling personal data.
As the name suggests, PETs are tools for organizations to identify, reduce, or eliminate potential data privacy risks. By deploying PETs across their systems, organizations can help minimize any leakage of sensitive personal information and demonstrate compliance with applicable data protection requirements. Some examples of PETs include tokenization, Differential Privacy, homomorphic encryption, federated learning, and secure multi-party computation.
Tokenization: is the process of replacing sensitive customer data such as names or SSNs with a pseudonym, an anonymous token, or a random string, that holds no associated meaning. This prevents any malicious actors from accessing valuable customer data should a breach occur. For example, a retailer may store a hypothetical credit card number 1234–5678–9011–2345 by replacing the middle 8 numbers with randomly generated strings or characters. This way the retailer can still identify and use the credit card, but it will never be exposed to any malicious actors if the database is ever breached. One short coming of this technique is that to use the credit card again in the future for legitimate uses (like automated subscription payments), the organization needs a deterministic way to recover the original card number from tokenized value. If the tokenization algorithm falls in wrong hands, it can lead to data privacy breach.
Differential Privacy: is a method for protecting the privacy of individuals in a dataset by adding random noise to the data in a way that it is difficult to identify any individual while still maintaining the overall information. The goal is to ensure that any information about any individual in the dataset is not revealed, while still allowing for useful analysis of the overall data. One example of how this works is the use of “differential privacy” in the US Census. The Census Bureau collects a large amount of information from individuals, including sensitive information like income and race. To protect the privacy of individuals, the Census Bureau adds noise to the data before releasing it to researchers. This makes it difficult for anyone to determine the information about a specific individual, while still allowing for overall trends and patterns in the data to be analyzed. Adding noise also creates challenges by making it hard to extract accurate insights from the data. As the amount of data increases, the amount of noise required to guarantee a certain level of privacy increases, which can make the data less useful for analysis. Differential privacy algorithms can be quite complex and difficult to implement, especially for large datasets or for certain types of queries. Lastly, implementing differential privacy can be computationally expensive, and may require specialized hardware or software.
Homomorphic Encryption: Homomorphic encryption is a type of encryption that allows for computations to be performed on ciphertext, which is the encrypted data. The result of the computation is still encrypted, but it can be decrypted to reveal the result of the computation on the original plaintext. This allows for sensitive data to be processed and analyzed without ever having to decrypt it, thereby maintaining the privacy and security of the data. An example is in the context of voting systems. A voting system can use homomorphic encryption to ensure the privacy and security of the votes. The system can encrypt the votes and then perform computations on the encrypted votes to determine the winner of the election. The encrypted votes can be decrypted to reveal the result of the computation, but the individual votes remain private. Homomorphic encryption can be challenging to implement due to its computational inefficiency, limited functionality, security risks, key management, scalability, lack of standardization, complexity, and limited commercial use. Additionally, more research is needed to improve the efficiency of homomorphic encryption algorithms to make it more practical and usable in real-world scenarios.
Federated learning: is a machine learning technique that allows multiple parties to train a model on their own data while keeping the data private and on-premise. This is achieved by training a model locally on each device or party, and then aggregating the model updates over a secure communication channel, rather than sharing the data itself. One example of federated learning is in the context of mobile devices. A mobile company may want to train a model to improve the performance of their keyboard app. With federated learning, the company can train the model on the data from users’ devices, without ever having to collect or share the data. The updated models from each device can be aggregated to improve the overall model. Federated Learning is computationally expensive and may require specialized infrastructure that typical organizations may not have access to. Additionally, combing data from different parties may have different distributions, which can make it difficult to train a single model that works well for all parties.
Privacy Enhancing Technologies are rapidly evolving with tremendous advancements made in the last 5 years. However, PET is not a magic bullet and there are a few challenges that still need to be overcome. The biggest one is that PETs are unique in their own ways and each offer different capabilities with different privacy vs utility tradeoffs. Organizations need to deeply understand their use cases and evaluate which PET would work best for their organization. In addition, some solutions may require significant IT resources or technical expertise for installation — meaning that not all organizations will have the capabilities to make use of this type of technology. PETs can also be costly for organizations or individuals to implement. Finally, these solutions require regular maintenance such as model drift corrections or re-training models with up-to-date data and as a result it can be difficult for organizations or individuals to keep up with the necessary updates to ensure effective security measures are still in place.
Passionate members of academia, research and startups are pushing through to overcome the challenges and make PETs a part of every organization’s SaaS toolkit. I highly encourage anybody interested to dive in and stay up to date by attending conferences, reading research papers, and joining the open-source community to get the latest updates.