We noticed that our internal predecessors of DALL·E 2 sometimes reproduced training images verbatim. This behavior was undesirable, as we would like DALL·E 2 to create unique and original images by default and not simply “stitch” pieces of existing images together. Additionally, reproducing training images verbatim may raise legal issues of copyright infringement, property, and privacy (if photos of individuals were present in the training data).
To better understand the issue of image regurgitation, we collected a data set of prompts that frequently resulted in duplicate images. To do this, we use a model trained to sample images for 50,000 indications from our training dataset and rank the samples by perceptual similarity to the corresponding training image. Finally, we inspected the top matches by hand and found only a few hundred true duplicate pairs out of the 50,000 total notices. Although the regurgitation rate appeared to be less than 1%, we felt that it was necessary to reduce the rate to 0 for the reasons stated above.
When we studied our data set of regurgitated images, we noticed two patterns. First, the images were almost all simple vector graphics, which were probably easy to memorize due to their low information content. Second, and more importantly, all the images had many near duplicates in the training dataset. For example, there might be a vector graph that looks like a clock showing 1 o’clock, but then we would discover a training sample containing the same clock showing 2 o’clock, then 3 o’clock, etc. Once we realized this, we used a distributed nearest neighbor search to verify that, in fact, all regurgitated images had perceptually similar duplicates in the data set. Other plays have observed a similar phenomenon in large language models, finding that data duplication is strongly linked to memorization.
The above finding suggested that if we deduplicate our data set, we could solve the regurgitation problem. To accomplish this, we plan to use a neural network to identify groups of images that look alike, and then remove all but one image from each group.[^footnote-2]
However, this would require checking, for each image, whether it is a duplicate of any other image in the dataset. Since our entire data set contains hundreds of millions of images, we would naively need to check hundreds of trillion image pairs to find all the duplicates. While this is technically within reach, especially in a large compute cluster, we found a much more efficient alternative that works almost as well at a small fraction of the cost. Consider what happens if we pool our dataset before doing the deduplication. Since close samples often fall into the same group, most duplicate pairs would not cross the group decision limits. We could then deduplicate the samples within each pool without looking for duplicates outside the pool, while only losing a small fraction of all duplicate pairs. This is much faster than the naive approach, since we no longer have to check every pair of images.[^footnote-3]
When we tested this approach empirically on a small subset of our data, it found 85% of all duplicate pairs usingk = 1024 clusters. To improve the success rate of the above algorithm, we took advantage of a key observation: when you cluster different random subsets of a data set, the resulting cluster decision bounds are often quite different. Therefore, if a duplicate pair crosses a cluster boundary for a data pool, the same pair might fall within a single cluster in a different pool. The more clusters you try, the more likely you are to discover a given duplicate pair. In practice, we decided on five clusterings, which means that we search for duplicates of each image in the union of five different clusters. In practice, this found 97% of all duplicate pairs in a subset of our data.
Surprisingly, almost a quarter of our data set was removed through deduplication. When we looked at the near-duplicate pairs that were found, many of them contained significant changes. Remember the clock example above: the data set can include many images of the same clock at different times of the day. While these images will likely cause the model to memorize the appearance of this particular watch, they can also help the model learn to distinguish between the hours of the day on a watch. Given the amount of data that was removed, we were concerned that removing images like this might have affected the performance of the model.
To test the effect of deduplication on our models, we trained two models with identical hyperparameters: one on the full dataset and one on the deduplicated version of the dataset. To compare the models, we use the same human evaluations that we used to evaluate our original GLIDE model. Surprisingly, we found that human raters slightly privileged the model was trained on deduplicated data, suggesting that the large number of redundant images in the dataset was actually hurting performance.
Once we had a model trained with deduplicated data, we reran the regurgitation search we had previously performed with over 50,000 hints from the training dataset. We found that the new model never regurgitated a training image when given the exact prompt for the image from the training dataset. To take this test a step further, we also perform a nearest neighbor search on the entire training dataset for each of the 50k generated images. In this way, we thought we could catch the model regurgitating an image other than the one associated with a given ad. Even with this more thorough verification, we never found a case of image regurgitation.