Document processing has been around since the advent of AI. In fact, one of the first deep learning models—the convolutional neural network (CNN) devised by Yann LeCun in the 1990s—was originally developed for banks to automatically process checks and for post offices to automate the processing of handwritten mailing addresses.
The discipline has progressed by leaps and bounds since those early forays into OCR-based approaches. Today, document processing typically relies on multiple, synergistic machine learning techniques to read a document, extract its relevant fields, and perform intelligent processing on top of that.
These systems are grounded in computer vision and natural language processing techniques that are used to grab visual information from the document and ensure that it’s parsed correctly.
While OCR is notorious for transcription errors, intelligent document processing, which is AI-driven, starts at higher accuracy, and it only improves over time. Here’s how AI is changing document processing.
The early days of deep learning focused primarily on improvements to computer vision classification models. This required developing better OCR technologies that could improve basic metrics such as character-level accuracy and word-error rate when extracting text from a document such as a PDF. In more recent years, NLP has exploded as its own sub-discipline within machine learning as more data and better sequential deep learning models have brought it to the forefront.
Today, NLP is often used to clean up the text that comes out of the OCR step of the pipeline. This approach typically employs techniques such as edit distance and language modeling, both of which can be used to predict the correct word in a given context.
While simple n-gram-based language models have been around since the dawn of AI, recent developments in transformers have allowed the creation of language models such as BERT, GPT-3, and XLNet, which make predictions with humanlike accuracy.
These models can be used for word sense disambiguation or to choose from many potential candidates when the OCR is not confident in a word prediction or makes an uncalibrated error. One example of this is the case of letter transcription errors, i.e. transcribing an I instead of 1 or a 0 instead of an O. Other techniques, such as named entity recognition and even regular expressions, can be used to extract useful information from the transcribed text such as names, dates, places, companies, and so on.
As much as machine learning models have improved in recent years, they still need quality data to be trained properly. The ease of obtaining quality training data at scale has improved dramatically with the number of both manual and automated data labeling solutions that are available. It is also possible to pay outsourcers who hire qualified human annotators—even domain experts—to manually label datasets at a relatively low cost.
This helps models generalize better and be fine-tuned to any company’s unique datasets through a process called transfer learning. What’s more, due to the rapid turnaround that data labeling services provide, it’s possible to regularly retrain models in response to newly gathered data, allowing them to improve rapidly as a company gains more customers. In addition, models can also be updated on the fly to adapt to different document types without the need to rebuild the entire extraction pipeline.
New computer vision model architectures have allowed smarter extractions of fields from forms and other structured documents. For example, the artificial recurrent neural network (RNN) architecture long short-term memory (LSTM) can be trained to predict form boundaries, something that was impossible in the early days of document processing. These systems longer simply extract text, thanks to computer vision: QR codes, bar codes, and images are all also fair game now.
Field matching can also be conducted on the basis of contextual language attributes and not solely on the basis of their position on the page. This is a huge improvement over older techniques that used hard-coded templates to perform OCR. These would obviously fail when confronted with a new document type or format.
These days, multimodal transformer architectures can extract document text and images while learning the document layout in a single step. This has led to orders-of-magnitude improvements in document understanding and allowed for the processing of media-rich documents of all forms and functions. Included in this is the ability to extract novel elements from a document such as QR codes and barcodes.
Linters are programs that check computer code for correctness and proper style. This idea can be extended to documents by combining NLP techniques such as named entity recognition, summarization, language modeling, and other methods to apply linting to natural language documents.
This means that many complex tasks can be automated via ML, such as checking legal contracts for validity, assessing hierarchical document structure for correctness, validating financial filings, etc.
Through linting and other enhanced NLP capabilities, the fields you want to extract from documents can be updated on the fly; there’s no need to rebuild your whole extraction pipeline.
The rise of pay-as-you-go cloud computing services such as AWS, Google Cloud, and Microsoft Azure has allowed companies with fewer resources to more easily get started building their own document processing pipelines. Elastic inference services such as AWS Lambda allow companies to rapidly construct their own APIs and offer them to customers as paid services. What’s more, the cloud vendors automatically handle the scaling of compute resources in response to user demand.
These services also make expensive hardware such as graphics processing units and tensor processing units available for on-demand training of machine learning models at a cost that’s a fraction of what it would be for most companies to go out and buy this same hardware outright. There are even services, such as Google Colab, that provide free access to GPUs for small ideation projects.
Furthermore, the number of MLops products currently flooding the market simplifies matters also. With all these choices, handling the training, deployment, security, and other relevant aspects of working with ML inference can be as simple as plugging in a model and letting the software handle the rest of the work.
This has dramatically lowered the barrier of entry for companies who just want to dabble in AI to see how it can speed up their document processing workflows without making a wholesale investment straight from the outset.
Businesses that don’t want the hassle of creating their own AI development teams and training their own models now have access to prebuilt document processing services. The major players in this area include Amazon's AWS Comprehend, Google Cloud's Document AI, Microsoft Azure's Form Recognizer, and Scale's Document AI. These document services can be queried to perform common NLP tasks such as sentiment analysis, language modeling, document understanding, text annotation, and more.
The OCR piece of the pipeline can be handled via these companies’ corresponding computer vision APIs. There are also many up-and-coming competitors in this field that offer their own API-based solutions.
Finally, businesses that are willing to put in some development work will find thousands of pretrained and precoded models available for download via open-source repositories on GitHub. This means any company can get up and running with basic AI-based document processing in a matter of days or weeks, for little or no cost.
AI-driven intelligent document processing brings much higher accuracy to OCR-based systems, can handle more complex documents, and can run on relatively inexpensive cloud-based services. NLP advances continue to improve accuracy, while outsourced manual data labeling services and automated data labeling advances help reduce costs.
Meanwhile, advances in computer vision have allowed the processing of more complex, media-rich documents. Thanks to these transformative changes in AI-based document processing, it’s easier than ever to get started. Relatively inexpensive data labeling services and cloud-based services that provide powerful hardware on a pay-as-you-go basis can help you get a pilot project up and running quickly.