Evaluation of the plausibility and usefulness of the data we generate from real data.
Synthetic data serves many purposes and has been attracting attention for some time, in part due to the compelling capabilities of LLMs. But what is “good” synthetic data and how can we know if we are successful in generating it?
Synthetic data is data that has been generated with the intention of looking like real data, at least in some aspects (scheme at least, statistical distributions,…). It is usually generated randomly, using a wide range of models: random sampling, noise addition, GAN, diffusion models, variational autoencoders, LLM,…
It is used for many purposes, for example:
- training and education (for example, discovering a new database or teaching a course),
- data augmentation (i.e. creating new samples to train a model),
- sharing data while protecting privacy (especially useful from an open scientific point of view),
- conduct research while protecting privacy.
It is especially used in software testing and in sensitive areas such as healthcare technology: having access to data that behaves like real data without compromising patient privacy is a dream come true.
Individual plausibility
For a sample to be useful it must, in some way, resemble real data. The ultimate goal is for the generated samples to be indistinguishable from real samples: generate hyper-realistic faces, phrases, medical records, etc. Obviously, the more complex the source data, the more difficult it will be to generate “good” synthetic data.
Utility
In many cases, especially in data augmentation, we need more than a realistic sample, we need a complete data set. And generating a single sample is not the same as generating a complete data set: the problem is well known, under the name mode collapsewhich is especially common when training a generative adversarial network (GAN).. Essentially, the generator (more generally, the model that generates synthetic data) could learn to generate a single type of sample and completely skip the rest of the sample space, leading to a synthetic data set that is not as useful as the set of original data. .
For example, if we train a model to generate images of animals and it finds a very efficient way to generate images of cats, it might stop generating anything other than images of cats (in particular, not images of dogs). The images of cats would then be the “mode” of the generated distribution.
This type of behavior is detrimental if our initial goal is to augment our data or create a data set for training. What we need is a data set that is realistic in itself, which in absolute terms means that any statistics derived from this data set must be close enough to the same statistic on real data. Statistically speaking, this means that the univariate and multivariate distributions should be equal (or at least “close enough”).
Privacy
We will not delve too deeply into this topic, which would deserve an article on its own. To be brief: depending on our initial goal, we may have to share data (more or less publicly), which means that if it is personal data, it must be protected. For example, we need to ensure that we cannot recover any information about any given individual from the original data set using the synthetic data set. In particular, that means watching out for outliers or checking that the generator hasn't generated any original samples.
One way to consider the privacy issue is to use the differential privacy framework.
Let's start by loading data and generating a synthetic data set from this data. We will start with the famous “iris” data set. To generate its synthetic counterpart, we will use the Synthetic Data Vault package.
pip install sdv
from sklearn.datasets import load_iris
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata.metadata import Metadatadata = load_iris(return_X_y=False, as_frame=True)
real_data = data("data")
# metadata of the `iris` dataset
metadata = Metadata().load_from_dict({
"tables": {
"iris": {
"columns": {
"sepal length (cm)": {
"sdtype": "numerical",
"computer_representation": "Float"
},
"sepal width (cm)": {
"sdtype": "numerical",
"computer_representation": "Float"
},
"petal length (cm)": {
"sdtype": "numerical",
"computer_representation": "Float"
},
"petal width (cm)": {
"sdtype": "numerical",
"computer_representation": "Float"
}
},
"primary_key": None
}
},
"relationships": (),
"METADATA_SPEC_VERSION": "V1"
})
# train the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)
# generate samples - in this case,
# synthetic_data has the same shape as real_data
synthetic_data = synthesizer.sample(num_rows=150)
Sample level
Now we want to test if it is possible to know whether a single sample is synthetic or not.
With this formulation, we easily see that this is fundamentally a binary classification problem (synthetic versus original). Therefore, we can train any model to classify original data from synthetic data: if this model achieves good accuracy (which here means significantly above 0.5), the synthetic samples are not realistic enough. Our goal is to obtain an accuracy of 0.5 (if the test set contains half original samples and half synthetic samples), which would mean that the classifier is making random guesses.
As in any classification problem, we should not limit ourselves to weak models and put a lot of effort into hyperparameter selection and model training.
Now for the code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifierdef classification_evaluation(
real_data: pd.DataFrame,
synthetic_data: pd.DataFrame
) -> float:
x = pd.concat((real_data, synthetic_data))
y = np.concatenate(
(
np.zeros(real_data.shape(0)),
np.ones(synthetic_data.shape(0))
)
)
Xtrain, Xtest, ytrain, ytest = train_test_split(
x,
y,
test_size=0.2,
stratify=y
)
clf = RandomForestClassifier()
clf.fit(Xtrain, ytrain)
score = accuracy_score(clf.predict(Xtest), ytest)
return score
classification_evaluation(real_data, synthetic_data)
>>> 0.9
In this case, it seems that the synthesizer was unable to fool our classifier: the synthetic data is not realistic enough.
Data set level
If our samples were realistic enough to fool a reasonably powerful classifier, we would need to evaluate our data set as a whole. This time, this cannot be translated into a classification problem and we need to use several indicators.
Statistical distributions
The most obvious tests are statistical ones: are the univariate distributions of the original data set the same as those of the synthetic data set? Are the correlations the same?
Ideally, we would like to test north-vary distributions for any northwhich can be especially expensive for a large number of variables. However, even univariate distributions allow us to see whether our data set is subject to modal collapse.
Now for the code:
import pandas as pd
from scipy.stats import ks_2sampdef univariate_distributions_tests(
real_data: pd.DataFrame,
synthetic_data: pd.DataFrame
) -> None:
for col in real_data.columns:
if real_data(col).dtype.kind in "biufc":
stat, p_value = ks_2samp(real_data(col), synthetic_data(col))
print(f"Column: {col}")
print(f"P-value: {p_value:.4f}")
print("Significantly different" if p_value < 0.05 else "Not significantly different")
print("---")
univariate_distributions_tests(real_data, synthetic_data)>>> Column: sepal length (cm)
P-value: 0.9511
Not significantly different
---
Column: sepal width (cm)
P-value: 0.0000
Significantly different
---
Column: petal length (cm)
P-value: 0.0000
Significantly different
---
Column: petal width (cm)
P-value: 0.1804
Not significantly different
---
In our case, out of the 4 variables, only 2 have similar distributions in the real data set and the synthetic data set. This shows that our synthesizer fails to reproduce the basic properties of this data set.
Visual inspection
Although there is no mathematical proof, a visual comparison of the data sets can be helpful.
The first method is to plot bivariate distributions (or correlation graphs).
We can also plot all dimensions of the data set at once: for example, given a tabular data set and its synthetic equivalent, we can plot both data sets using a dimension reduction technique, such as t-SNE, PCA, or UMAP . With a perfect synthesizer, the scatterplots should look the same.
Now for the code:
pip install umap-learn
import pandas as pd
import seaborn as sns
import umap
import matplotlib.pyplot as pltdef plot(
real_data: pd.DataFrame,
synthetic_data: pd.DataFrame,
kind: str = "pairplot"
):
assert kind in ("umap", "pairplot")
real_data("label") = "real"
synthetic_data("label") = "synthetic"
x = pd.concat((real_data, synthetic_data))
if kind == "pairplot":
sns.pairplot(x, hue="label")
elif kind == "umap":
reducer = umap.UMAP()
embedding = reducer.fit_transform(x.drop("label", axis=1))
plt.scatter(
embedding(:, 0),
embedding(:, 1),
c=(sns.color_palette()(x) for x in x("label").map({"real":0, "synthetic":1})),
s=30,
edgecolors="white"
)
plt.gca().set_aspect('equal', 'datalim')
sns.despine(top=True, right=True, left=False, bottom=False)
plot(real_data, synthetic_data, kind="pairplot")
We already see in these graphs that the bivariate distributions are not identical between the real data and the synthetic data, which is a further indication that the synthesis process failed to reproduce the high-order relationship between the dimensions of the data.
Now let's look at a representation of all four dimensions at once:
plot(real_data, synthetic_data, kind="umap")
In this image it is also clear that the two data sets are different from each other.
Information
A synthetic data set should be as useful as the original data set. Especially, it should be equally useful for prediction tasks, meaning it should capture complex relationships between features. Hence a comparison: TSTR vs TRTR, which stands for “Train on synthetic test in real” versus “Train on real test in real”. What does it mean in practice?
For a given data set, we take on a given task, such as predicting the next token or the next event, or predicting one column given the others. For this given task, we train a first model on the synthetic data set and a second model on the original data set. We then evaluate these two models on a common test set, which is an extract of the original data set. Our synthetic data set is considered useful if the performance of the first model is close to the performance of the second model, whatever the performance. It would mean that it is possible to learn the same patterns on the synthetic data set as on the original data set, which is ultimately what we want (especially in the case of data augmentation).
Now for the code:
import pandas as pd
from typing import Tuple
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressordef tstr(
real_data: pd.DataFrame,
synthetic_data: pd.DataFrame,
target: str = None
) -> Tuple(float):
# if no target is specified, use the last column of the dataset
if target is None:
target = real_data.columns(-1)
X_real_train, X_real_test, y_real_train, y_real_test = train_test_split(
real_data.drop(target, axis=1),
real_data(target),
test_size=0.2
)
X_synthetic, y_synthetic = synthetic_data.drop(target, axis=1), synthetic_data(target)
# create regressors (could have been classifiers)
reg_real = RandomForestRegressor()
reg_synthetic = RandomForestRegressor()
# train the models
reg_real.fit(X_real_train, y_real_train)
reg_synthetic.fit(X_synthetic, y_synthetic)
# evaluate
trtr_score = reg_real.score(X_real_test, y_real_test)
tstr_score = reg_synthetic.score(X_real_test, y_real_test)
return trtr_score, tstr_score
tstr(real_data, synthetic_data)
>>> (0.918261846477529, 0.5644428690930647)
It clearly seems that the “real” regressor learned a certain relationship, while the “synthetic” regressor failed to learn this relationship. This suggests that the relationship was not faithfully reproduced in the synthetic data set.
Synthetic data quality assessment is not based on a single indicator and metrics must be combined to get a complete picture. This article shows some indicators that can be easily built. I hope this article has given you some useful tips on how to do it better for your use case.
Don't hesitate to share and comment