Sign in or Join the community to continue

Monitoring and Quality Assurance of Complex ML Deployments via Assertions - with Stanford University Dawn Lab

Posted Oct 27, 2021 | Views 3.3K

# Tech Talk

# MLOps & Infrastructure

Share

Speaker

Daniel Kang

PhD Student @ University of Stanford, DAWN Lab

Daniel Kang is a sixth-year PhD student in the Stanford DAWN lab, co-advised by Professors Peter Bailis and Matei Zaharia. His research focuses on systems approaches for deploying unreliable and expensive machine learning methods efficiently and reliably. In particular, he focuses on using cheap approximations to accelerate query processing algorithms and new programming models for ML data management. Daniel is collaborating with autonomous vehicle companies and ecologists to deploy his research. His work is supported in part by the NSF GRFP and the Google PhD. fellowship.

+ Read More

SUMMARY

Machine Learning (ML) is increasingly being deployed in complex situations by teams. While much research effort has focused on the training and validation stages, other parts have been neglected by the research community. In this talk, Daniel Kang will describe two abstractions (model assertions and learned observation assertions) that allow users to input domain knowledge to find errors at deployment time and in labeling pipelines. He will show real-world errors in labels and ML models deployed in autonomous vehicles, visual analytics, and ECG classification that these abstractions can find. I'll further describe how they can be used to improve model quality by up to 2x at a fixed labeling budget. This work is being conducted jointly with researchers from Stanford University and Toyota Research Institute.

+ Read More

TRANSCRIPT

Daniel Kang (02:08):

Hi, my name is Daniel Kang, and today I'll be talking about monitoring and quality assurance in complex ML deployments with assertions. To set the stage for this talk, errors in ML models can lead to downstream consequences. For example, there have been serious accidents that have already involved autonomous vehicles. And this is a more general trend where errors in ML models can have extreme consequences. And we think that this is particularly problematic because there's no standard way of doing monitoring or quality assurance over these models and over these pipelines that deploy these models.

Daniel Kang (02:46):

However, if we take a step back and look at Software 1.0 or software without machine learning, this software is also deployed in mission-critical settings. For example, software powers medical devices, rockets, and a whole range of other important devices. And part of the reason we trust the software is that important software is monitored and has rigorous quality assurance. We have a whole suite of tools ranging from assertions, to unit tests, to regression tests, et cetera, that help us vet this important software. And so in our research, we ask the question, can we design monitoring and quality assurance methods that work across the ML deployment stack? And in this talk, I'll describe abstractions for finding errors in ML deployments and in labeling pipelines.

Daniel Kang (03:35):

Throughout this talk, we'll use two properties. And these properties are that errors in ML models and labels can be systematic. For example, I'm showing here an example of an object detection trying to detect the bounding box of this car in this video. And as you can see, the prediction of the box flickers in and out rapidly. And we can write an assertion of the form car should not flicker in and out of this video fairly easily. Similarly, on the right hand side here, I'm showing an example of a missing motorcycle in a labeling pipeline. This is concrete example of a more general phenomena where labelers can consistently miss certain objects.

Daniel Kang (04:18):

And while these errors may seem trivial, if we look at the safety report for one a self-driving car collision, the safety report says, "As the automated driving system change the classification of the pedestrian several times, alternating between vehicle, bicycle, and an other, the system was unable to correctly predict the path of the detected object." And so these errors are critical to detect and find. In the remainder of this talk, I'll describe two abstractions that we've developed to help with these tasks.

Daniel Kang (04:51):

The first are model assertions and the second are learned observation assertions or LOA. And we'll start with model assertions. But first to put model assertions in context, there's a whole pipeline for deploying ML models, including data collection, model development and training, statistical validation, and deployment and monitoring. And there's been a lot of work on the statistical validation side of things and some on the model development and training for finding errors, but less so on the data collection and labeling and deployment and monitoring. And this is where model assertions come in.

Daniel Kang (05:25):

And importantly, many users, potentially not even though model builders, can collaboratively add assertions. And so how can they go about doing this? Well, first describe what model assertions actually are. For context, model assertions was work published in MLSys 2020. I'll put a link to the paper later in this presentation. So model assertions are black box functions which indicate when errors in ML models may be occurring. The inputs to model assertions are a history of inputs to the ML model and a history of predictions from the ML model. In this example for the bounding box detection use case, the input to the flickering assertion are the recent frames and the recent outputs of the model. A model assertion in contrast to a [inaudible 00:06:08] software assertion outputs a severity score. And here, this is a continuous value of float where a zero is an abstention by convention.

Daniel Kang (06:18):

Once we have these model assertions defined, we can then use them to find errors in ML models. For example, let's say we have these three frames of a video. We can write down an assertion of the form if there's a box in frame one and three, there should be a box in frame two. If this assertion triggers, they can be used for corrective action, for example, shaking the wheel of a car, or for on time monitoring, for example, populating a dashboard. In addition to this particular assertion, there are examples of other ones. What I'm showing here is an assertion that says that predictions from different autonomous vehicle sensors should agree. For example, there are two sensors here, a LIDAR and a camera sensor, and one predicts a truck and the other predicts a car for this vehicle type. And so necessarily one of them is incorrect.

Daniel Kang (07:03):

Importantly, assertions can be specified in little in few lines of code. For example, I'm showing here the assertion from the previous slides specified in code with [inaudible 00:07:13] code with a couple of helper functions. But nonetheless, these are typically very simple. In addition to specifying model assertions as black box functions, we can also specify model assertions via a consistency API we developed. The consistency API allows users to automatically specify two kinds of assertions. The first is that transitions cannot happen too quickly. For example, what I'm showing here is a unique identifier assigned to an object or a thing where it appears in timestamps one, two, and four, but not in timestamp three. And this would trigger an assertion. Similarly, within a given scene or time period, attributes with the same identifier must agree. So for example, the predicted gender and hair color here are a condition on the identifier being correct.

Daniel Kang (08:02):

As a concrete example of this, let's take a look at deploying model assertions for TV news analytics. The model predicts Christi Paul on the left hand side and Poppy Harlow on the right hand side. And we can specify that this is problematic using our consistency API by saying that overlapping boxes in the same scene should agree on the attributes. And this is automatically specified via our consistency assertions.

Daniel Kang (08:27):

In addition to specifying model assertions and using them for corrective action and for runtime monitoring, perhaps surprisingly, we can also use model assertions to help train models. The way this works is that given a set of inputs that triggered assertion, we can use a human labeler to generate human labels, for example, using Scale AI, and use these human generated labels to train the model. While there are other methods of selecting data points to collect, model assertions are agnostic to data type, task, and model, whereas typically previous methods for collecting data require some specific assumptions about one of these three. And so this is a new data collection API.

Daniel Kang (09:07):

However, this raises several questions, the most important for this particular aspect being how should we select data points to label for active learning? Let's say we have a set of data points here, and one assertion flags these two data points and another flags these two data points. Well, this is a more general trend where many assertions can flag the same data points and the same assertion can flag many data points, which raises a question, which points should we label given these potentially conflicting signals? And we've developed a model assertion-based bandit algorithm. I won't have time to go into full details of this algorithm, but the idea is to select model assertions with the highest reduction in assertions triggered and assumptions on how model assertions relate to the quality of the model the sense of being a probably good idea.

Daniel Kang (09:57):

We can also use model assertions to train models via weak supervision. In this method, given a set of inputs that triggered assertion, we can use model assertions paired with corrective rules to generate weak labels and use these weak labels to retrain the model. Similarly to the active learning, using weak supervision is not a new idea, but this is a new method of generating weak labels which can be used in conjunction with other forms of weak supervision. As a concrete example of a correction rule to generate these weak labels, let's look at the flickering examples from before. Here, the green boxes are the predicted boxes and the blue box is the box that's filled in from the surrounding two frames automatically. Similarly for the consistency API, we can automatically correct or propose corrected labels by using the majority attribute as the updated label. And so, for example, here, we would predict M as the updated label.

Daniel Kang (10:52):

To evaluate model assertions, we deployed model assertions across four real world datasets, a video analytics dataset, a self-driving car dataset, an ECG reading dataset, and a TV news and analytics dataset. And the TV news analytics dataset is in the full paper. And for the render of this section, we deployed five total assertions on the first three datasets. And for our returning experiment, we used a quality metric of mean average precision where here mean average precision is a standard metric for object detection accuracy. The first thing we show is that model assertions can find errors with high true positive rate where here the true positive rate is if the model assertion flagged the data point whether or not that data point actually contained an error. The first reassertions were deployed on the video analytics use case and the next two were deployed on these self-driving car and ECG dataset respectively.

Daniel Kang (11:46):

And as we can see here, the true positive rate across all five assertions we deployed here was at least 88% meaning that model assertions can be deployed not only with few lines of code, but also with high true positive rate. In other words, it finds errors with high precision. We also show that model assertion based active learning outperforms baselines. In particular, we compare against random sampling and uncertainty sampling for selecting data points and the round of active learning corresponds to a fixed number of data points selected. As we can see here, model assertions outperforms both baselines for selecting data points. To give a qualitative sense of the improvement, I'm showing the original SSD on the left and the best retrained SSD on the right where this is over the same clip of data. As we can see, the right hand side is substantially more consistent than the left hand side and makes far fewer errors.

Daniel Kang (12:39):

[inaudible 00:12:39] described model assertions I described learned observation assertions or LOA. To put LOA in context, this is work published in the AIDB workshop in 2021, and LOA particularly focuses on the data collection and labeling, but can also be used for deployment and monitoring. And the reason that we developed a tool specifically for vetting data collection and labeling is that vetting training data is critical for safety and liability reasons. But why do we need this in the first place? Well, perhaps surprisingly, training data is rife with errors. What I'm showing here are real examples from a Lyft Level 5 dataset. As we can see here, there's a whole host of missing cars, a missing truck, and even a missing car in motion. And this dataset has been used to develop models and host competitions. And if we give these models a faulty training data, we'll get erroneous results out. And so even the best-in-class labeling services misses critical labels. And this is why it's so important to have systems in place to help find these errors.

Daniel Kang (13:49):

So how can LOA be used to find these errors? Well, at a high level, users define priors over observation or groups of observations, and then these observations or groups of observations are automatically ranked to determine which ones are most likely to contain errors. LOA leverages two organizational resources, which are things already present and existing deployments. But first are existing human labels. Existing human labels can provide examples of expected behavior. For example, expected box volumes, track lengths, and a whole host of other properties. Given these existing labels, we can then learn priors over expected and unexpected values for the properties I mentioned earlier. For example, the box volume.

Daniel Kang (14:33):

The second set of organizational resources we use are existing ML models. And as we can see here, ML models can actually be fairly accurate most of the time. And as a result, they can provide information about potentially missing tracks in human labels, which are particularly problematic and important to find. As a concrete example, as I showed before with this motorcycle, the human labeler missed this motorcycle which is only present in about half a second of this video, but nonetheless, it's critical to detect because we don't want the vehicle striking this motorcycle because it didn't detect it. The ML model actually picks up this motorcycle just fine and can propose it as a potentially missing label.

Daniel Kang (15:12):

Okay, so how do we evaluate LOA? Well, we deployed LOA over two real world autonomous vehicle datasets. The first dataset we deployed this over is the Lyft Level 5 dataset, which is publicly available. You can in fact download this dataset and see all the errors that've shown in this talk today. And the second dataset it was done in conjunction with the Toyota Research Institute, TRI, and as an internal dataset. We deployed LOA across both datasets to find errors, particularly for cars, pedestrians, and motorcycles, and we measured the precision at 10, which means for each method we used to find errors, took the top 10 ranks errors and counted how many of them were actually errors in the human labels.

Daniel Kang (15:53):

The first thing we show is that LOA can find errors with high precision. What I'm showing here is the precision at top 10 on the Lyft dataset and our internal dataset. And as we can see here, FIXY, the system that influence LOA, outperforms on both datasets. One thing that is not easy to see from these metrics is actually that the Lyft dataset is substantially more noisy than our internal dataset and results in lower quality models. And so it's actually perhaps more impressive that FIXY can achieve such a high precision for the Lyft dataset given the amount of noise present in this dataset. As concrete examples of errors in human labels, there are these missing motorcycles, missing cars, and even these missing cars in motion.

Daniel Kang (16:35):

In addition to finding errors in human labels, LOA can also find errors in ML models not found by ad-hoc model assertions. As I mentioned, model assertions are often created manually. To measure this, we excluded model errors found by model assertions and then measured the precision of LOA and uncertainty sampling at finding model errors excluded by model assertions. And as we can see, LOA achieves a precision of 82% compared to 42% from uncertainty sampling, which is almost a two times improvement. As a concrete example of errors found in ML models, we can see that in here, the orange is the ground truth and here the boxes in black are predicted by the model. As you can see, the model predicted boxes are inconsistent over time. And so LOA can flag them as being likely to be erroneous. And so in particular, I can find overlapping but unlikely tracks not found by model assertions.

Daniel Kang (17:36):

Everything that I mentioned in this talk is published and has code that is open sourced or will be open sourced and the links are here. Please take a screenshot if you would like to access them. But in conclusion, errors are rife in both training data and for ML models at deployment time. And we present model assertions and LOA, two abstractions for finding errors in ML pipelines. And I think we need much more work for the ML deployment stack beyond just training. And I hope that this talk serves as an inspiration for work that focuses on parts beyond just the training pipeline.

Daniel Kang (18:10):

My email is here and please reach out if you'd like to talk about anything or have any questions. And also my Twitter handle is here as well. Thank you for your time and please reach out if you have any questions or would like to chat.

+ Read More

Watch More

OpenAI’s InstructGPT: Aligning Language Models With Human Intent

Posted Sep 09, 2022 | Views 44.9K

# Large Language Models (LLMs)

# Natural Language Processing (NLP)

A Global Perspective on AI With Eric Schmidt

Posted Oct 06, 2021 | Views 39.3K

# TransformX 2021

# Fireside Chat

ML at Waymo: Building a Scalable Autonomous Driving Stack with Drago Anguelov

Posted Oct 06, 2021 | Views 35.2K

# TransformX 2021

# Keynote