Why ML Needs a Human in the Loop

# Document Processing

# Human in the Loop

Putting humans in the loop gives your machine learning models the final push they need to achieve truly game-changing accuracy. Here's why.

Barrett Williams

Focusing on improving your machine learning models gets you only so far. Engineers and researchers often like to think that if they ensemble, tweak, and tune models indefinitely, they’ll maximize model accuracy. Yet in many scenarios, pesky (and decidedly unexciting) data labeling errors make those final few points of accuracy simply unreachable.

Putting humans in the loop—in other words, using both human and machine intelligence to develop and improve ML models—can make a huge difference.

In intelligent document processing (IDP), for example, humans can easily review and catch common errors, eliminating erroneous samples in the training data. Whether your model is analyzing mortgage applications, invoices, or health insurance cards, the human touch can give your ML models the final push they need to achieve truly game-changing, business-defining accuracy.

Human-in-the-loop (HITL) can take two forms. Traditionally, an engineer with domain expertise can pause model training to adjust hyperparameters or decide to restart a training job with a slightly modified model architecture. More flexible, and perhaps also more important for long-term success, is on-the-fly label correction and classification error mitigation. These can and should occur between every training run of your model, especially if new training data is constantly following into your training dataset.

Here’s why HITL should be part of your sustainable ML pipeline.

Off-the-Shelf Datasets Are Only Part of the Answer

When most data scientists tackle a new modeling problem, they look for preexisting datasets in the domain they are investigating. For example, an engineer designing a computer vision system for a household robotic vacuum might use the Microsoft COCO dataset to train the device to recognize common household objects.

Inevitably, though, every dataset is intended for a specific purpose, so the engineer will likely need to augment any off-the-shelf dataset with new, additional training data. Similarly, the engineer may find that a pretrained YOLO model may work well for certain classes of objects the vacuum needs to detect, but not for others.

The engineer has two choices:

Augment the dataset (COCO, in this example) with captured images from a prototype robot

Attempt transfer learning by using a base model (YOLO, in this example) and then retrain the model on an entirely new dataset captured by the prototype robot

In fact, both of these options may prove to be viable strategies, but ultimately both will yield models that may not identify objects of all listed classes with equal accuracy or will at some point yield classification results that prevent the engineer from shipping her product.

Humans Are Competent at Catching Most Model Errors

ML models are nearly always nondeterministic, in the sense that, given the same input data, a model may or may not deliver exactly the same output at inference time due to a slightly different training regime or because of an input image whose pixels are slightly shifted or skewed in one direction.

Even if a model delivers a robust, reliable result nine times out of 10, the 10th result could be unexpected. Humans are skilled at catching these “silly” or fundamentally jarring errors in classification or transcription.

Thus, adding an error-catching layer of human intervention, particularly for low-accuracy or uncertain classifications, can help to bring classification accuracy of your model’s output up from, say, 90% to 99%. Even if some state-of-the-art ML models outpace human capabilities on specific tasks, the combination of ML model plus human review nearly always outpaces the accuracy of either alone.

In the case of both document processing and object detection, bringing a model’s accuracy up to 99% might require an army of 100 or even 1,000 humans. You’ll need to choose between deploying an in-house workforce and hiring a distributed, remote labeler workforce to check the validity and performance of a model on a sample dataset or even a data object.

Once certain data samples or even datasets are reviewed or corrected by humans, they can be augmented to represent an outsized portion of the training dataset, increasing their relative influence on the model’s accuracy—although this introduces some recency bias toward the new data, which may or may not be desirable.

That said, human involvement is not ideal for repetitive tasks such as a hyperparameter grid search or taking a brute-force approach to trying out a list of “tips and tricks” to get a model to converge or simply perform better.

Systematize Your HITL

There are a couple of ways to include HITL in your ML training process. Tools such as Apache Airflow introduce the concept of directed acyclic graphs (DAGs), which represent the steps in your training and modeling process. A DAG can let you save a model checkpoint for human review or adjustment, while ML infrastructure keeps training the model to completion.

The ML engineer can tweak the checkpoint model and train the new, modified model to completion. This can be particularly useful if members of an ML team are distributed across multiple time zones and team members can collaboratively guide the modeling process based on shared or even disjointed domain expertise.

But the more valuable option is to send low-confidence samples or suspected erroneous data to data experts for relabeling and correction. APIs can help send suspected erroneous samples and classification examples to a distributed labeling service with a worldwide, on-demand workforce. Then humans can catch and remedy errors on an almost round-the-clock basis, eliminating the need for any pauses in the modeling pipeline.

You don’t have to wait when a training job completes but an ML practitioner is asleep or otherwise unavailable.

Humans Play a Role in Preventing Model Drift

Many models, once deployed in services that ingest real-time, real-world data, will drift over time. Model drift can be caused by many factors, including expanding a product to a new region, changing weather patterns, or unpredictable world events such as a pandemic.

Even if a model performs reliably for months, these changes can reduce accuracy such that the model is no longer viable. Human review, error correction, and relabeling can help mitigate model drift, particularly if the sections of the dataset that are relabeled or verified are pertinent to the changes in the source data.

Build a Sustainable, Drift-Proof ML Pipeline

HITL QA, error checking, and labeling can play a crucial role in successfully training a new model, or even engaging in transfer learning. At the model level, ML engineers can impart domain expertise to guide a model toward conversion or better performance, but often this type of HITL involvement scales only to the size of the data team.

HITL error checking and label error mitigation can scale to a global, distributed, on-demand workforce, enabling ever increasing model performance, even assuming that model architecture and hyperparameters never change. DAG-oriented tooling such as Apache Airflow can help deliver low-latency human feedback for both model and dataset adjustments and should serve as an essential component of any modern, sustainable, drift-proof ML pipeline.

Why ML Needs a Human in the Loop

Putting humans in the loop gives your machine learning models the final push they need to achieve truly game-changing accuracy. Here's why.

Popular

Related