Why is AI-generated synthetic data all the rage these days? In this article, I’ll explain my favorite way: with cats!
Let’s say I want to train a cat-not-cat classifier from scratch, but I only have one photo to work with:
(Everything that follows is an analogy for what people do with tabular data and text data, so it applies beyond image data.)
Ideally, I’m going to need a dataset consisting of thousands of cat and not-cat photos. If I have a camera and plentiful access to cats, I can take a bunch of photos like the one I already have, ensuring that I get exactly the dataset I designed:
But what if I don’t have a camera and I live catless on the moon? I could get the images I need from a vendor, though I ought to be careful since inherited data is more dangerous than primary data.
But what if there’s no vendor who’ll sell me some cat photos? (Yes, running out of cat photos on the internet is a situation that’s more sci-fi than living on the moon, but bear with me.)
Well, if I can’t collect them and I can’t buy them, then I’ll have to make them myself. Behold, my creation:
No good? Yeah, drawing was never my strong suit. Another way to make fake data is to copy existing datapoints, except this isn’t going to be much use for providing instructional variety.
It’ll be like teaching a human student by giving them the same example over and over again, so all they learn is that one thing. If my dataset is 30,000 copies of this Huxley photo…