Many probabilistic machine learning algorithms and deep learning methods require large amounts of data for learning and model building. In many real-world situations, data may be limited or be too expensive to capture. Data augmentation offers a solution to the problem of limited data and complements the original or incomplete dataset with more training examples and data points.
In the field of computer vision, many ML models for object recognition and classification require large amounts of data for training. With few or limited examples, these methods tend to overfit the training data. Overfitting refers to a model specializing or adapting itself perfectly to the training examples, including the noise in those examples. Hence, the model loses its generalization capability and performs poorly on unseen data or test examples.
For example, if you train a neural network on a few images of handwritten digits, it will learn to recognize all the digits in the training set correctly. However, the system will have a high error rate on images of handwritten digits that it has not seen before.
To circumvent the problem of overfitting, you add data to the training set and train the model using the larger set. The augmented data can be composed of synthetically generated or augmented data points from existing examples or it may be acquired through other means such as open-source datasets related to the same application. In this article, we’ll focus on artificially generated data.
Experiments have shown that data augmentation increases the generalization ability of a learning system and can significantly improve the accuracy rate of the system.
Data augmentation is required when the training set has few or limited training examples. Here are a few possible scenarios:
There are many data augmentation techniques you can choose from that have successfully solved problems in computer vision. A nice taxonomy of these methods has been defined by Connor Shorten and Taghi Khoshgoftaar in a survey on image data augmentation for deep learning. Figure 1 below has been simplified and adapted from their paper. It shows some of the more common and important methods, and I’ve added sampling/probabilistic methods to it. You can read their paper for more details.
Figure 1: A taxonomy of data augmentation techniques (simplified and adapted from “A Survey on Image Data Augmentation for Deep Learning,” by Shorten and Khoshgoftaar, in the Journal of Big Data, 2019)
Basic image manipulations include simple techniques (see Figure 2) to derive a new image from one or more images. There are many ways to do this, including:
Figure 2: Various geometric transformations. Source: Mehreen Saeed
Deep learning architectures are an extension of neural networks with many layers. Neural networks are ML models inspired by the workings of the human brain. They can be trained to learn a nonlinear function that maps an input image to an output image.
Many deep learning architectures, such as convolutional neural networks (CNNs), have the strategy of generating new images implicitly built within the model itself. An image is convolved with different filters to generate new images or representations, which are then passed to more layers for learning.
Below are some of the data augmentation techniques that explicitly generate new images and are based on deep learning architectures:
Other data augmentation methods include:
Figure 3: Synthetic face images generated via sampling from Gaussian mixtures. Source: Mehreen Saeed
Data augmentation techniques improve the accuracy of computer vision models. Using additional images during the training phase adds variety and more features to your existing data, which your model can use to generalize more and reduce overfitting. It’s interesting to note that many augmented images are not comprehensible by humans and that it is not completely understood why such images improve the performance of the system.
You can choose a data augmentation technique based on traditional methods of manipulating images, or you can go with a more sophisticated strategy, such as one based on neural networks.
The method you choose should depend upon your resources and your application. For example, you can use SMOTE to deal with class imbalance or use traditional geometric transformations that are easier to understand, interpret, and generate.
Alternatively, you can use deep learning methods to create a larger dataset when you have a lot of processing power and memory resources available to you.