As deep learning models grow larger, so too does the demand for the data needed to train them. Since most organizations don’t have access to data at the scale needed to train some of the newest state-of-the-art deep learning models, it is now becoming common practice to augment datasets with synthetic data, i.e., data that is artificially generated to match the distribution of the training data.
Generative models are the main workhorse of synthetic data creation, and a few broad classes of those models lend themselves well to particular problems. This article will present classes of generative models, give some examples of each, and discuss how to apply them to synthetic data generation.
Probably the best-known class of generative model is the GAN. Originally developed by Ian Goodfellow in his seminal 2014 research paper, “Generative Adversarial Nets,” GANs work by training two networks, one called “the generator” and the other called “the discriminator.”
The generator is trained to generate samples similar to those in the training dataset, and the discriminator is trained to predict whether those samples are from the underlying data distribution (“real”) or not (“fake”). The idea is that the generator tries to fool the discriminator into thinking that it is generating real samples and, in doing so, learns how to approximate the true data distribution.
Since the creation of the original GAN, thousands of variations and improvements have been introduced. These days, it seems as if GANs can learn anything, from 3D faces to emojis to fashion styles to music.
Of course, that’s not to say training a GAN is easy. GANs frequently suffer from unstable training, since the competing loss functions from the generator and discriminator can be difficult to optimize simultaneously.
One common problem is unwittingly creating vanishing gradients, usually caused when the discriminator is overpowered and can’t be fooled by the generator. In this case, the generator can’t get any signal to train on since it can’t generate any positive examples, so the back-propagated gradient is zero, or close to it.
A solution in this case might be to add some regularization to the discriminator or reduce the number of parameters it has in order to give the generator a fighting chance. Using different loss functions such as the Wasserstein loss or a modified minimax loss can also help to ameliorate the problem.
Another issue arises from what’s called “mode collapse.” In this pesky scenario, the generator learns to predict a single image that fools the discriminator. However, since it’s already managed to fool the discriminator, it doesn’t have motivation to actually increase the loss by predicting new, novel images that the discriminator may or may not detect.
There are many modifications that can be made to the model to address this. One is to use the Wasserstein loss; another is to use a model called “unrolled GANs,” which changes the loss function to incorporate future discriminator outputs.
GANs are used extensively to generate synthetic data in domains where native data is limited. For example, in the medical imaging domain, datasets are often quite small due to the cost and difficulty associated with running MRIs (sometimes with contrast) on patients. Furthermore, data acquisition is often further limited by HIPAA regulations.
As such, researchers have come up with novel ways to augment such datasets. In this paper, a team of researchers—Veit Sandfort et al.—“trained a CycleGAN to transform contrast CT images into non-contrast images.”
They then trained a U-Net on the augmented dataset of synthetic and nonsynthetic images and evaluated its performance compared to the same model trained only on the original data. They found that the model trained on the combined dataset showed significantly improved performance, particularly on segmentation of non-contrast images.
Another famous generative model is the variational autoencoder (VAE), a type of autoencoder neural network trained based on principles of Bayesian learning. The math underlying it is complex, but in simple terms, the network tries to estimate the data distribution p(x) by breaking it down into an integral over a parameterized distribution p(x|z)p(z).
The integral serves the purpose of marginalizing out the z’s, which are referred to as “latent variables.” The network is trained to encode an input x into the latent space and then decode it from the latent space by producing an output x that’s as close to the original input as possible. In doing so, it learns features over the latent variables that are relevant to the representation of the data.
Once the model is fully trained, researchers can generate synthetic data by perturbing the values in the latent space and then running the decoder to generate new outputs. This is nice because, not only can you generate new outputs, but you can also choose which variables you want to permute and thereby tweak specific features of the output.
For example, if you’re training a VAE to generate faces, there might be separate latent variables that correspond to skin color and hair style, and you can adjust those individually to modify those specific attributes.
One issue with VAEs is that they can produce “blurry” or “muddied” outputs. Often more complicated versions of the VAE are required to ameliorate this issue.
The literature on data augmentation with VAEs is also somewhat sparser than that for GANs; however, this seems to be rapidly changing. For example, a recent paper by Clément Chadebec and Stéphanie Allassonnière discussed a VAE that embeds the input data into a latent space characterized as a Riemannian manifold.
While most VAEs embed in zero-curvature Euclidean space, generalized Riemannian manifolds can have positive, negative, or varying curvature, and embedding in such a space allows the model to learn the intrinsic geometry of the input data.
The researchers applied their model to multiple datasets, including MNIST, a fashion dataset, and OASIS, a collection of MRI brain images of patients with and without Alzheimer’s disease. What’s exciting about their model is that classifiers trained on the generated synthetic data alone actually performed better than those trained on the original input data. This suggests that their model learned geometric aspects of the data that were critical to downstream classification.
Flow-based models estimate a probability density using the technique of normalizing flows. The idea is to start with a simple, tractable distribution that can then be translated into a more complex one by applying a series of invertible transformations.
For those who have taken a statistics class, the idea should be somewhat familiar, since it just amounts to applying the change of variables theorem from statistics over and over by multiplying a probability distribution by a Jacobian matrix.
The key design parameters in a flow-based model are the starting distribution and the transformation functions. Normally, the starting distribution is something simple, such as a zero-mean Gaussian. The transition functions are chosen to be well defined, easily implemented via a neural network layer, and iterable.
For example, in the RealNVP model, the chosen transformation functions are such that they keep the first D dimensions of the input fixed and then scale and shift the remaining dimensions via two simple functions, s and t. The GLOW model’s transformation functions use an activation normalization, followed by an invertible 1x1 convolution, followed by an affine coupling layer.
Flow-based models make it easy to model novel and complex probability distributions. They can also be extended to sequences using what are called autoregressive flows, making them great for generating sequential data such as speech.
A 2021 paper by Tomoki Uemura et al. shows the application of flow-based models to data augmentation in the medical imaging domain. They applied a 3D CNN based on the 3D GLOW model to generate synthetic volumes of interest (VOIs) of polyps in a CT colonography dataset.
What’s amazing is that they tested their model by having trained human observers attempt to distinguish between the polyps from real CTs and those generated by their model. The humans were unable to distinguish between the real and synthetic data, a testament to its effectiveness. They found that incorporation of the synthetic VOIs also greatly improved downstream classification accuracy.
However, flow models do have some (manageable) downsides. Defining the correct transformation functions that give enough flexibility but are also computationally tractable can be tricky. Further, inference and sampling with autoregressive flows can be slow, since the downstream variables in the flow depend on the samples of earlier variables in the sequence. Thus, inference in such models can’t be readily parallelized.
Generative models provide a fantastic means for generating synthetic data in situations where training data is scarce. Each of these three classes of generative models can be applied in different scenarios depending on their various strengths and weaknesses.
GANs are often useful in settings involving images, video, 3D models, or other continuous data, and they can be trained to produce very realistic samples. The downside of using GANs is that they can suffer from unstable training, and it can take some work to correct this. They also do not tend to work well for generating discrete or sequential data, at least not without significant modifications.
VAEs are similar to GANs in that they work well for images, 3D, and other types of continuous data. They are useful in cases where an understanding of the latent space and fine-grained control over the data generation process is desired.
Finally, flow-based models are a relatively new, much-discussed area of research. They can be applied when direct density estimation is needed and for generation of sequential, discrete, and noncontinuous data—though they work great for continuous data as well.