Just a few years ago, the idea of adding synthetic data to your training pipeline for machine learning models might have seemed like either a fool’s errand or a brave experiment with little chance of success. Today, many companies are leveraging synthetic data at scale.
Ever since researchers started using ML to classify images, a small and passionate cohort has asked, Do I really need to go through the time-consuming and potentially costly process of capturing and labeling my own data? In some cases, it’s easiest to expand on an existing dataset with augmentation, or even to synthesize additional data in a systematic way.
Today, the practice of augmenting training data with both synthesis-from-scratch and parametric transformations of existing training data has gone from a research niche to a production dependency of many large-scale ML pipelines, especially in computer vision. Businesses often prefer synthetic data because it:
Synthetic data today is no longer an experiment—it is actively used in production by a range of large enterprises for a variety of use cases. For example, Alphabet’s Waymo subsidiary has invested extensively in both simulation infrastructure and synthetically generated environments to multiply the “real-world” miles its vehicles have traveled to cover digital environments its vehicles have never physically encountered.
Outside of the realms of pure computer vision and natural language understanding, health insurer Anthem works with Google Cloud to anonymize patient data in order to train better models, while J.P. Morgan and American Express use tabular synthetic data to enhance their fraud detection capabilities.
Still, some experts question whether ML models can still benefit if the synthesized data isn’t perfectly convincing to humans. (In the field of simulating lifelike human faces, less-than-convincing renderings are said to occupy the uncanny valley.) In my experience in the field of autonomous vehicles, engineers often questioned whether compressing video frames in a way that is nearly imperceptible to the human eye might obfuscate useful data to a self-driving-oriented computer vision model. And yet, with every passing year, simulations become more accurate, both visually and in terms of physical kinematics (collisions, deformations, reflections, lighting, and more). The improving nature of these simulations means that the “sim2real” gap, the space between simulation and reality, is closing.
It’s not just better visual artifacts we’re creating with more polygons per object. In the field of reinforcement learning (RL), it’s also possible to procedurally generate environments, typically with a process called “domain randomization,” which means that the autonomous agent can simulate its decision-making processes in random and nearly infinite permutations of surroundings—limited only by compute power and the duration of each simulation.
A more nuanced approach is to train a model by programming the computer to act as an “adversary” that tries to deceive or force an error in the model, and then retrain it to avoid making the same error under those conditions and other similar scenarios. Most ML models are trained to minimize a “loss function,” and many reinforcement learners are trained on a “reward function” (think scoring as many goals as possible, making as many paper clips as possible, etc.) A newer approach applies the loss function to RL models’ individual neural net activations such that the model no longer discriminates between synthetic and real training data. The goal is to induce the model to treat synthetic and real data as equals and thus be able to generalize across both, potentially reducing the overfitting that might occur if the model were trained on real data alone.
While reading the previous paragraph, the word adversary might have caused you to think about GANs, or generative adversarial networks. These models are well known for synthesizing convincing images of objects specified as text, or even performing “style transfer” to make a casual photo look as though it were painted by Monet or Van Gogh. These networks can be used to synthesize convincing faces without the need to train the model on personal information (often known as PII). But in fact, the biggest benefit of GANs comes when they are applied to the features or embedded weights in the models themselves. The research paper “Unsupervised Domain Adaptation by Backpropagation” explains how applying an adversarial approach to the weights themselves, paired with a loss function, can result in a viable unsupervised learning process.
Many practitioners start off by training on real-world data then realize that their model cannot generalize to rare scenarios and edge cases. Only then do they think to augment their dataset with synthetic data and retrain the model in order to test it in these uncommon environments. But recently, in the case of small real-world datasets, researchers have discovered that models perform better when the base training dataset is synthetic (and appropriately as large as possible, given cost, time, and compute constraints). Then, transfer learning, fine-tuning, or output layer retraining occurs on only real data.
Rich Sutton, a renowned RL researcher at both the University of Alberta and DeepMind, has established a “bitter lesson test” for AI that posits that academics spend far too much time making incremental tweaks to models based on domain expertise. Typically, it’s more general approaches—often computationally intensive ones—that triumph in the end. Model performance tends to scale faster with Moore’s law (and its derivatives), particularly when sheer compute power can unlock a new modeling paradigm, as it did for deep tree search in the case of IBM’s Deep Blue chess-playing supercomputer, Hidden Markov Models in the case of speech recognition, and, more recently, AlphaZero in the case of the game of Go.
Hand coding conditionals and feature detectors simply doesn’t scale as Bayesian, evolutionary, or even random search when it comes to model architectures, data sampling, and hyperparameter tuning. As in the case of training a model first on synthetic data and then fine-tuning on real-world data (a sort of “pre-validation” set), it makes sense to get out of the way of the radical new modeling approaches enabled by large compute clusters, rather than try to micromanage human-scale improvements through small tweaks and adjustments that leverage more fragile domain expertise.
In some cases, it’s easier to expand your dataset with augmentation or to synthesize additional data than it is to go to the time and effort required to capture and label your own data. Synthetic data can help with data scarcity, data bias and privacy issues, and it can speed up the data collection and annotation process, allowing for faster experimentation. And when real-world datasets are small, models tend to perform better when your base training dataset includes synthetic data. The examples above are just a few of the many business applications of synthetic data. There’s never been a better time to discover whether synthetic data can assist you in your journey to train and deploy better-performing ML models.