How to Build an Autoencoder Using TensorFlow
Here's how to build an autoencoder for image compression, image reconstruction, and supervised learning using the TensorFlow library.
In this article, I'll discuss using TensorFlow for supervised classification tasks, and we’ll work with a dataset of faces to build a simple autoencoder. We’ll use it for reconstructing the original face images and also visualize the latent space and build a supervised classifier from it that performs face recognition. For the implementation part, we’ll use TensorFlow and Keras library to build our model.
An autoencoder has two parts: an encoder and a decoder. The encoder learns a latent representation of the input data, and the decoder is trained to reconstruct the original inputs from the latent representations. The autoencoder has the following applications.
- It autoencoder approximates the original input points from the latent representations. This makes it useful for data recovery from corrupt inputs.
- As the autoencoder learns a latent representation of the input data, it can be designed so that the dimensions of this latent space is much smaller than the original input dimensions. Hence, an autoencoder can be used for data compression.
- Autoencoders find their application for data augmentation. The outputs from the autoencoder represent synthetic data and, hence, can be added to the original training set to increase its size.
- You can use the latent representation from an autoencoder to learn classification and regression tasks.
Here's how to get started building your own autoencoder. (Note: You can run the code shown in this tutorial in Google Colab or download the Python notebook here. Output shown from the code examples this article will not match the output that you get at your end: The output will vary with each run of the program because of the stochastic (random) nature of the algorithms involved.)
A Conceptual Diagram of the Autoencoder
Figure 1 below shows a conceptual diagram of the autoencoder we are about to build. The encoder appears in green, the decoder in pink. The input to the encoder is an rxc face image. The encoder uses the Flatten layer to vectorize the image to an r*c dimensional vector. The flatten layer passes the input to a dense layer, where the dense layer has half as many units as the original image pixels. The final layer of the encoder is the Dense layer with m units, where m is much smaller than the number of input image pixels. This is the latent representation of the input. Effectively, the encoder compresses the input to a smaller representation.
The decoder expands the encoder’s latent representation to produce an approximation of the input image. Hence, in this block, a smaller dense layer is followed by a bigger dense layer.
Figure 1: The autoencoder model. Flatten and reshape layers reshape the received inputs without changing the values. Source: Mehreen Saeed
The Import Section
Before starting the implementation, import the following libraries/modules in your code.
Load the ‘Labeled Faces in the Wild’ People Dataset
The scikit-learn library includes the Labeled Faces in the Wild (LFW) dataset, which consists of gray scale face images of different people. Because the dataset is imbalanced, we’ll load only the face images of people with at least 100 images included. We’ll also resize each image to half its dimensions to make the dataset more manageable.
Let’s load the dataset and display a few images along with some data statistics.
Prepare the Train and Test Data
The next step is to prepare the train and test data. The code below implements the following steps:
- Use train_test_split() method to split the dataset into a 75% training and 25% test set.
- Normalize each image’s pixel values to lie between 0 and 1. Because the grayscale images have pixel values between 0 and 255, we can divide each pixel value by 255 to normalize the entire image.
- Use the to_categorical() method to convert each target value (range [0, 5]) to a five dimensional binary categorical vector.
- Print the training and test data statistics.
Create the Autoencoder Model
The code below shows how you can use the Flatten, Dense and Reshape layers to create the autoencoder model shown in Figure 2. The value of the latent_dimension is set at 420. You can experiment with different values of this variable.
Because we want the input images to match the output images, we optimize with respect to the mean square error (mse). This can be specified as a parameter to the compile() method.
Figure 2: The autoencoder model with layer names used in the code. Source: Mehreen Saeed
Now let’s look at the summary of the autoencoder model we just built.
Training the Autoencoder
Now we are ready to train the autoencoder model. The fit() method below trains the model and returns a history object with details of the entire training process.
We can visualize the entire learning process by using the values stored in the dictionary object of history. The following code prints the keys of the history.history dictionary object and plots the training and validation loss for each epoch. As expected, the value of the loss function for the training set is lower than the loss value for the validation set.
Reconstructing the Input Images
After training the autoencoder, we can look at what the reconstructed input images look like. The predict() method returns the output of the autoencoder for the inputs specified as a parameter. The code below displays the first eight images of the test set in the first row and their corresponding reconstructions in the second row.
The reconstructed images are quite interesting. We can see that they are an approximation of the original images and each closely replicates the facial expression of the input face. This makes them useful for augmenting a limited training set with more examples. You can generate as many images as needed by training the autoencoder multiple times. The reconstructed faces will vary with each run as the weights of the autoencoder are initialized randomly.
Dissecting the Encoder
TensorFlow allows you to access the different layers of a model. You can easily retrieve the encoder block of the autoencoder by using the Model() method and instantiating it with the input images and output latent layer we created earlier. Let’s look at the summary of the encoder model.
The summary shows that we started out with an input of 2,914 (62x47) pixels and reduced it to only 420 outputs. These 420 outputs are an internal/latent space representation of the corresponding input image. The output of the encoder is hard for us to interpret. However, we can give it a try and visualize it by displaying it as an image. The code below arbitrarily reshapes the latent representation to a 20x21 image and renders it. The top row shows the input training image, and the bottom row shows the corresponding latent representation.
Using the Autoencoder for Supervised Learning
Strictly speaking, an autoencoder is not a supervised learning model, since it is trained with unlabeled images. However, we can use its latent representation to train a supervised learning model. The code below instantiates a classifier model by using all the layers of the autoencoder, from the vector_images layer up to the decoder_hidden layer. It then appends the model with a softmax layer that contains as many units as the number of classes/categories present in our dataset.
In the classifier model, we’ll set all the encoder layers to be nontrainable. This way the input image will be converted to its learned latent representation to be further processed by the classifier. The classifier will start training with the decoder weights tuned by the autoencoder and fine-tune them further to learn the classification of each latent representation. Figure 3 shows a block diagram of the classifier created from the autoencoder model. Now that we have a multiclass classification problem, we can use the ‘categorical_crossentropy’ as our loss function when compiling the model.
Figure 3: The classifier model created from the encoder. Source: Mehreen Saeed
Here is the summary of the classifier model. It has 620,687 trainable parameters associated with the last two layers. The rest of the weights up through the encoder layer are nontrainable.
Let’s train the classifier using the fit() method.
Evaluate the Classifier
The code below prints the classification accuracy on the training and test sets. Because this is a multiclass classification problem, it’s good to observe the confusion matrix. TensorFlow’s math module provides a routine for computing the confusion matrix, but we’ll use the method provided by the scikit-learn library. This library also includes a nice method for displaying the confusion matrix with fancy colors. Now you have reasonably good accuracy on the face classification task without doing any preprocessing, hyper-parameter tuning, and model selection.
The Next Step
You just developed an autoencoder that can create an approximate representation of its inputs. You also used its latent representation to develop a classifier and applied it to a face recognition problem. While this may not be the best example to demonstrate the merits of the autoencoder as a supervised classifier, it should give you a fairly good idea of how to implement an autoencoder model and use its latent space representation for other tasks.
Now that you have a basic understanding of the autoencoder model, you can develop more advanced variants, such as a variational autoencoder or sparse autoencoder.
You can run the entire code shown in this tutorial in Google Colab or download the Python notebook here.
Learn More
- How to Use TensorFlow.js to Create JavaScript-Based ML
- Transformers: What They Are and Why They Matter