Recently, differential privacy (DP) has emerged as a mathematically sound notion of user privacy for data aggregation and machine learning (ML), with practical implementations including the 2022 US Census and in the industry. In recent years, we have opened libraries to preserve privacy analytics and ML and they have been constantly improving their capabilities. Meanwhile, the research community has developed new algorithms for various analytical tasks that involve private data aggregation.
One of these important data aggregation methods is the heat map. Heat maps are popular for visualizing aggregated data in two or more dimensions. They are widely used in many fields, including computer vision, image processing, spatial data analysis, bioinformatics, and more. Protecting the privacy of user data is critical to many heatmap applications. For example, heat maps for genetic microdata are based on private data of individuals. Similarly, a heat map of popular locations in a geographic area is based on user location records that must be kept private.
Motivated by such applications, in “Differentially private heat maps” (featured in IAA 2023), we describe an efficient DP algorithm to calculate heat maps with verifiable guarantees and we evaluate it empirically. At the core of our DP algorithm for heatmaps is a solution to the basic problem of how to privately add sparse input vectors (i.e., input vectors with a small number of non-zero coordinates) with a small measured error. for him Ground Conveyor Distance (EMD). Using a hierarchical partitioning procedure, our algorithm views each input vector, as well as the output heat map, as a probability distribution over a number of elements equal to the dimension of the data. For the sparse aggregation problem under EMD, we provide an efficient algorithm with asymptotically close to the best possible error.
Algorithm Description
Our algorithm works by privatizing the aggregate distribution (obtained by averaging all user inputs), which is enough to calculate a final heat map that is private due to the post-processing property of DP. This property ensures that any transformation of the output of a DP algorithm remains differentially private. Our main contribution is a new privatization algorithm for aggregate distribution, which we will describe below.
The EMD measure, which is a dissimilarity measure similar to the distance between two probability distributions originally proposed for computer vision tasks, is well-suited for heatmaps as it takes into account the underlying metric space and considers “neighboring” bins. . EMD is used in a variety of applications including deep learning, spatial analysis, human mobility, image retrieval, facial recognition, visual tracking, shape matching, and more.
To achieve DP, we need to add noise to the aggregate distribution. We would also like to keep the statistics at different scales of the grid to minimize the EMD error. Therefore, we create a hierarchical partition of the grid, add noise at each level, and then recombine it into the final aggregated PD distribution. In particular, the algorithm has the following steps:
- Construction of four trees: Our hierarchical partitioning procedure first divides the grid into four cells, then divides each cell into four subcells; it recursively continues this process until each cell is a single pixel. This procedure creates a quad tree over the subcells where the root represents the entire grid and each leaf represents one pixel. The algorithm then calculates the total probability mass for each node in the tree (obtained by summing the probabilities of the aggregate distribution of all leaves in the subtree rooted at this node). This step is illustrated below.
In the first step, we take the aggregate (non-private) distribution (up to the left) and split it repeatedly to create a quadtree. Then, we calculate the total probability mass of each cell (below). - Noise addition: To the mass of each node of the tree we add Laplace noise calibrated for the use case.
- truncation: To help reduce the final amount of noise in our aggregated PD distribution, the algorithm traverses the tree starting at the root, discarding all but the top at each level. w nodes with higher (noisy) masses along with their descendants.
- Reconstruction: Finally, the algorithm solves a linear program to retrieve the aggregate distribution. This linear program is inspired by the poor recovery literature where noisy masses are seen as (noisy) measures of the data.
Experimental results
We evaluated the performance of our algorithm in two different domains: real-world location verification data and image salience data. We consider as a baseline the ubiquitous laplace mechanism, where we add Laplace noise to each cell, zero out the negative cells, and produce the heat map from this noisy aggregate. We also consider a “threshold” variant of this baseline that is more suitable for sparse data – just stay up you% of cell values (based on probability mass in each cell) after noise making while zeroing out the rest. To assess the quality of an output heatmap compared to the actual heatmap, we use Pearson coefficient, KL-divergenceand EMD. Note that when the heatmaps are more similar, the first metric increases but the last two metrics decrease.
The location data set is obtained by combining two data sets, Gowalla and bright comet, which contain user records from location-based social networks. We preprocessed this dataset to consider only records in the continental US, resulting in a final dataset consisting of ~500,000 records from ~20,000 users. Considering the top cells (from an initial partition of all space into a 300 x 300 grid) that have records of at least 200 unique users, we divide each cell into subgrids with a resolution of ∆ × ∆ and assign each check – in one of these subnets.
In the first set of experiments, we set ∆ = 256. We test the performance of our algorithm for different values of ε (the privacy parameter, where smaller ε means higher DP guarantees), ranging from 0.1 to 10, running our algorithms along with the baseline and its variants across all cells, randomly sampling a set of 200 users in each trial, and then calculating the distance metrics between the true heatmap and the DP heatmap. The average of these metrics is presented below. Our algorithm (the red line) outperforms all versions of the baseline on all metrics, with improvements being especially significant when ε is not too large or small (ie, 0.2 ≤ ε ≤ 5).
The metrics averaged over 60 runs as ε varies for the location data set. Shaded areas indicate a 95% confidence interval. |
Next, we study the effect of varying the number north of users Fixing a single cell (with > 500 users) and ε, we vary north from 50 to 500 users. As theory predicts, our algorithms and baseline work best as north increases. However, the behavior of the baseline thresholding variants is less predictable.
We also perform another experiment in which we fix a single cell and ε, and vary the resolution ∆ from 64 to 256. Consistent with theory, the performance of our algorithm remains nearly constant over the entire range of ∆. However, the baseline suffers across all metrics as ∆ increases, while the threshold variants occasionally improve as ∆ increases.
Effect of the number of users and grid resolution on EMD. |
We also experiment in Salicon Imaging Prominence Dataset (SALICON). This data set is a collection of prominence notations at Microsoft Common objects in context image database We reduce the size of images to a fixed resolution of 320 × 240 and each [user, image] The pair consists of a sequence of coordinates in the image where the user looked at. We repeated the experiments described above on 38 randomly sampled images (with ≥50 users each) from SALICON. As we can see in the examples below, the heat map obtained by our algorithm is very close to the ground truth.
Additional experimental results, including those for other data sets, metrics, privacy parameters, and PD models, can be found in the paper.
Conclusion
We present a privatization algorithm for sparse distribution aggregation under the EMD metric, which in turn produces an algorithm for producing privacy-preserving heatmaps. Our algorithm naturally extends to distributed models that can implement Laplace’s mechanism, including the safe aggregation model and the random pattern. This does not apply to the stricter local PD modeland it remains an interesting open question to design practical local DP EMD/heatmap aggregation algorithms for a “moderate” number of users and privacy settings.
expressions of gratitude
This work was done in conjunction with Junfeng He, Kai Kohlhoff, Ravi Kumar, Pasin Manurangsi, and Vidhya Navalpakkam.