Thanks to the increasing availability of mobile phones with high-quality cameras, it’s easier than ever to digitize documents. People can quickly and easily take photos of documents, bypassing the need to use a scanner.
These images often need additional processing, however, to ensure that they are of high quality and easy to read. For example, raw images are often distorted due to camera angle and have uneven lighting. Both of these issues can negatively affect document readability and optical character recognition, but standard digitization techniques cannot effectively correct these problems.
In the paper “Document Rectification and Illumination Correction Using a Patch-based CNN,” researchers developed a preprocessing technique for improving the quality of document images. They began by building a synthesized dataset containing 1,300 ground-truth and distorted images. They then trained a convolutional neural network (CNN) to rectify patches of the document images, removing distortions, and then stitched the image patches into a single undistorted image. Finally, they trained a second neural network to correct uneven illumination in the rectified images.
The study found that this patch-based approach could outperform even document rectification performed with a dataset of as many as 100,000 images.
To train the networks for this study, the researchers first needed to create a synthetic-image dataset. They collected a variety of electronic documents, such as technical papers, books, and magazines, ensuring that the dataset included differing document structures and fonts. After collecting these files, the researchers converted the documents into images, which formed the ground truth for the models.
The next step was to create distorted-image pairs that would serve as input for the models. The researchers added light and texture to the documents to make them look more like real-world images, and they performed a variety of common transformations on the images, projecting them to different model surfaces.
The final dataset consisted of 1,300 ground-truth and transformed images, which were randomly separated in a training and test set. The images were then divided into patches with a 25% overlap with their neighboring patches. This resulted in a total of 100,000 patches for training and 10,000 for testing. This dataset was then used to train both the rectification and illumination correction models.
After creating the dataset, the researchers next aimed to improve image quality by rectifying the document images. Researchers used a patch-based learning method to correct various image distortions within a single image. Because document images can have different types of distortion based on region, this patch-based approach makes it possible to identify and correct various types of distortion, rather than applying a single transformation to an entire image.
Researchers trained a neural network with an auto-encoder structure using the image patches described above. The network encodes patch pairs and then calculates the transformation for each patch by minimizing the difference between the estimated transformation and the ground-truth image. The model then applies the transformation to remove the distortion.
After the image patches are transformed, the next step is to stitch them back together into a single document. The simplest approach is to first resample the images and then stitch the patches together based on their transformations. This approach, however, often yields low-quality results due to limited document texture or repetitive features, making feature matching difficult.
Instead, the researchers reversed this approach by first stitching the transformed patches back together and then resampling the image to generate the full document. For this approach to work properly, however, the transformed patches cannot be stitched together directly. Because the applied transformations vary by image patch, the researchers needed to account for the displacement between the transformed patches.
So rather than stitching the image patches directly, they stitched them together within the gradient domain. Because the gradient field in the images is independent of the document transformations themselves, researchers could reconstruct the full document from the input patches in this way.
The rectification technique described above does not yet account for lighting issues and may even introduce new shading artifacts as a result of the transformation and stitching process. Researchers resolved these lighting errors by performing illumination correction. Their aim was to correct illumination artifacts while maintaining the original color information within the documents.
They then trained a second neural network using the patches described above. Similar to the first model, this neural network was trained to minimize the difference between the estimated image and the ground-truth image.
Rather than using an encoder-decoder structure, however, this model used high-frequency data across all layers to ensure that no feature resolution was lost. The model outputs a high-frequency image with corrected illumination, providing the final, high-quality document image.
You can use the techniques described in this paper to produce high-quality digitized documents based on photos. Rectifying document images and correcting uneven illumination can greatly improve document readability, making it possible to get high-quality document images from a phone camera.
Try out the techniques for yourself by downloading the publicly available code on GitHub.