For most machine learning implementations, good isn’t good enough. When stakes are high, only a high-performing ML model will do, but achieving a high level of performance in ML models is extremely challenging. Below I'll walk you through through three lessons I've learned that allow ML engineers to build high-performance ML models and deploy them in production.
The challenge here isn’t hard to understand. We typically measure the performance of ML models using aggregate metrics that are likely to tell you that the model is correct in perhaps 90% of cases. But what about the remaining 10%? These are often the long tail of the data distribution consisting of rare but highly mission-critical edge cases. Without tackling this long tail, you won’t be able to build and efficiently iterate on high-performing ML systems for critical, real-world applications.
For example, consider autonomous vehicle (AV) systems. They must robustly detect pedestrians in all kinds of situations. Many situations involving pedestrians are not critical, because the pedestrians are on the sidewalk. Some situations, however, such as a pedestrian walking on a crosswalk at night, are mission-critical for safe AV operation. If you rely on aggregate metrics on a full test dataset only, the crosswalks at night might be underrepresented. While the aggregate performance on the whole test set might go up, the performance on one of the critical cases might degrade.
Aggregate metrics are necessary, but not sufficient to properly evaluate model performance. You need to set up evaluations on smaller, mission-critical subsets of data to better understand where your model is performing well and where you're still facing problems.
Edge cases include everything that can go wrong in the real world that you never expected would happen when you first trained your model. Identifying, analyzing, and learning from these edge cases means accepting the fact that building ML systems is not a one-time task that you’ll get right on the first try. As Andrew Ng recently stated, the process of ML development does not end with your first model deployment; it merely starts there.
To travel the road to high-performing ML models, you need to continuously identify your edge cases and improve model performance on them. Using the right tools can significantly simplify things so you can identify model issues on subsets of data, form granular insights into data biases and performance metrics, and make both the data as well as metrics intuitive to explore. The result will be faster development of high-performing ML models.
In many cases, adding as much metadata as possible to the recorded data can help you obtain more granular insights and better structure the data for a more detailed model analysis, which in turn will help you to identify biases and edge cases. In addition, intelligent tools that help you find data that is similar to observed failure cases are a game changer for edge case mining.
Let’s consider another example related to robot grasping. Robots sometimes struggle with robustly detecting objects when objects are reflected on flat surfaces. While you could just avoid having such flat surfaces, the proper fix for this perception model is to find more of those potential failure cases, augment the training dataset, and retrain the model. But how do you find more such potential failure cases among all of the recorded data? The answer is to use intelligent tooling, which allows you to conduct a fine-grained similarity search on the whole dataset and curate what should be annotated and added to the training set.
Finding and fixing many such edge cases takes time and effort but is crucial to move from decent to high-performing ML models. To make iterations more efficient, you need to automate as many of these processes as possible.
As fixing model performance on edge and failure cases can be a lengthy process, it’s especially important to avoid performance regressions on every failure case you fix. Once a bug has been resolved, it needs to stay that way. After all, you don’t want to have to manually track the performance of all previous bugs to ensure that they remain corrected, a process that would bog down the development process to an unsustainable degree.
Consider a similar example as above: By retraining a perception model with more hatched road markings data, the performance when detecting pedestrians on crosswalks suddenly starts to degrade. This can cause dangerous situations once you've deployed the model in the vehicle. It is crucial to catch performance regression early in the process and iterate before a model with new failure modes is deployed.
The problem described above is not new to engineers: Software development has been facing a related problem over the last few decades. Successful software engineers working on complex projects set up sophisticated CI/CD pipelines for unit and integration tests that let them easily define the desired behavior of software and catch regressions if that behavior is not achieved. It’s important to use similar testing pipelines for ML models.
The solution is to move away from simply looking at aggregate metrics for testing and see if an ML model improves on those. Make an effort to better understand performance on mission-critical edge cases and define dedicated input/output tests for your models to ensure that you achieve the desired performance not just for your test sets as a whole, but for every scenario.
Once you've set up proper testing pipelines with a large scenario coverage for your ML models, you won’t need to fear introducing regressions with new iterations—not even when you’re dealing with an underrepresented class. You’ll be able to iterate quicker and ultimately deploy your models with confidence.