Before vehicles can safely drive themselves, they must be able to tackle the final 1% of difficult or ambiguous situations. We asked Yushi Wang, a machine learning research engineer at Scale AI, why these edge cases are so difficult to navigate and how AV (autonomous vehicle) companies are tailoring their data and models to meet the challenge.
Wang works on the Scale AI ML team specializing in autonomous vehicles. He has worked as an ML back-end researcher for several companies, including AV makers Zoox and Tesla. He also was the first employee of Helia, a computer vision company that was acquired by Scale AI.
Q: What are the biggest challenges in building AVs today?
Initially, the teams working on autonomous vehicles were trying to figure out how to approach the problem, what sensors were required, and how to formulate the tech stack. But since then, many of them have started to coalesce on certain approaches. Today, everyone seems to have an approach that works reasonably well, but the challenge that they face now is training models to understand the last 1% of the scenarios that they have encountered on the road. The difficult scenarios, such as trying to navigate around complicated traffic, navigate around many obstacles, navigate under adverse weather conditions—the long tail, so to speak.
Q: What does such a tech stack for AV companies look like?
Many AV companies have certain modules in common, such as their software stack. Some of the details might differ, but there are certain common modules, including the sensors, the software that turns that data into some understanding of the world around the vehicle, and the software that uses this understanding to control the car itself. Some companies also include a mapping component, or they use GPS. But overall, these are the components that you would see the most. The biggest challenge now is understanding the edge cases. The last 1% takes 99% of the time.
Q: What does this last 1% of edge cases consist of?
When I'm talking about the end of the long tail, I'm talking about events that are in the range of 1% or 0.1% or 0.01% of the time that you spend driving and that, as a human, you need to be able to respond to: things like accidents, very crowded scenes at intersections, unusual activity by pedestrians, atypical construction scenes, unexpected vegetation, anything that might be out of the ordinary from what you normally see in your training data. It’s not that we don’t have data. We don't have a good way to understand where the edge cases are. That’s the problem. Edge cases may happen very rarely, but the AV must be prepared for them in order to be fully self-driving. And they often need training data that the machine learning model might not have seen before.
Q: What are some of the strangest edge cases you've seen in autonomous vehicle data?
Anything you might imagine can happen. People leave shopping carts all over the place, you can see raccoons darting across the street, or you might see pedestrians walking willy-nilly about in an intersection, so many that you can't visually track them all. And somebody who darts into the street is always an issue. Double-parked vehicles might present obstacles that you have to navigate around and also plan around.
Q: What type of training data is needed to better handle these edge cases?
Your data might have plenty of examples of regular trucks. But I’ve seen many instances of unusual obstacles or vehicles, such as a truck towing or carrying another truck. Sometimes you might see a bicycle on the back of a truck. How do you plan to drive around that? Because that doesn't look like something that the machine learning model might have seen before. What do you do in those cases? How does the model determine how to navigate in that situation? Classifying and understanding these situations that aren’t reflected in your training data is difficult.
This is something that occurs in any machine learning problem. So the way we approach them, the techniques that we use here, can be applied elsewhere.
Even though your training data may cover 99% of the things that you're likely to see on the road, that's not enough. Let's say 99.9% of the time your car does what it needs to do. Now imagine you have a million cars driving around in various cities. Even if your vehicles are stopping at a red light 99.9% of the time, those million vehicles may encounter thousands of red lights per day. That's thousands of possible accidents. So your training data has to cover all of the edge case scenarios that you may possibly encounter.
Q: What is the usual workflow for developing an AI-enabled computer vision system for an AV?
An ML researcher or team first determines what information it needs to solve the problem, such as detecting something like a truck in the environment and controlling a vehicle to avoid it. Then they start prototyping, usually modifying an existing machine learning model, reformulating parts of it to give the desired output. Iteration is the key. Nobody ever comes up with a perfect model on the first try, and iteration is standard engineering practice.
Q: How does model iteration typically happen?
Finding the best way forward is a huge challenge. In school, we were taught to tune the hyper-parameters of the model, such as the weight to assign to the output of each learning node or how many times you train each iteration of the model. But sometimes, to reduce or eliminate edge cases, you have to change the problem you’re trying to solve.
Sometimes the problem is not even formulated correctly. You might have to rehash your problem into a new thing. Maybe you keep your data entirely the same, but instead of trying to predict this kind of behavior or trying to predict this signal, you try to predict something else.
Q: Can you give an example of how you change the problem you’re trying to solve in order to reduce or eliminate an edge case?
At one company, we were working to train the algorithm for detecting lane markers on the road to guide vehicle trajectories. Our first approach was to detect the lane markers in two dimensions and then reproject these into 3D. But after many months of tuning the network and the geometric logic, we still had issues with inaccurate lanes. Instead, as a team, we started exploring approaches that would directly infer lane geometry using a top-down view, which bypassed the projection problem altogether and gave us better results in the long run. Of course, changing the problem formulation isn’t always a magic bullet, but it may give you different or fewer edge cases and a different way forward if one path is blocked.
But I’ve also learned that sometimes your model isn't the problem.
Q: If not the model, what else might need to change?
Often you have to change the dataset that you’re using to train the model. You might need to add data that is underrepresented, such as that required to train the model for edge cases like uncommon obstacles.
Q: What kind of data do AV companies use to train AVs?
While some companies are adding newer types of data, such as from LiDAR (light detection and ranging), in general it consists of images that are labeled to indicate what the object is. For example, if you’re trying to build an algorithm that can detect cars, you need humans to draw a box around the cars or find those labeled images from another source. Then you need to segment the image, to understand which pixels correspond to roads, buildings, vehicles, to all sorts of classes of objects. Getting this data en masse is probably one of the biggest roadblocks to establishing a successful ML team.
Q: What kind of data is needed for edge cases?
One AV company had a few famous incidents a few years back. They noticed that when people were trying to get off the freeway, when there was a shadow on the white line indicating the edge of the road, the vehicle sometimes confused the shadow with the paint, and this caused accidents. The company then focused on how to train its model to handle this specific situation. So they needed more data, but it had to be the subset of data that focused on the situations their algorithm wasn’t performing well on.
In this case what they did was find a bunch of clips with the same sort of lighting conditions on the part of the freeway where people were trying to get off. And they trained the model to detect the lane lining. This is what I mean by understanding your data. You start out saying, “Here's the dataset of everything,” then you go on to discover that there’s one part of the dataset you don’t do well on, and you focus on those kinds of areas.
You find one edge case, you stamp it out; you find another edge case and you stamp it out. And you keep going until you finally maybe stamped out enough that your car crashes far less often than a human.
But the truth is that when you’re dealing with a space this large, the entire real world, you can’t analytically understand everything. You can only do something that statistically is going to work most of the time, do better than humans, and then continue to improve it. But you have to accept that not everything can be analytically inferred.
Q: How does an ML researcher get enough properly annotated data?
This has always been a difficult and sometimes expensive problem. If I was working by myself, expecting to create a fully annotated, comprehensive dataset from scratch would be incredibly hard. I’d have to find a way to annotate it with high accuracy and likely hire someone to do it for me, since I’m not going to sit in a room and draw 1,000 or 10,000 boxes around certain types of objects.
There are plenty of widely annotated datasets, which is convenient, but they will only get you so far. Why? Because if the training data is well labeled and well defined for a certain problem, then it may have already been largely solved. So there’s less likelihood it will address your edge cases. But let’s say you’re trying to solve a new problem, like trying to control a car in dark, rainy scenes or in a country with different rules of the road and traffic signs.
Creating an annotated dataset focusing on edge cases is where things get expensive and difficult because the best annotation techniques for that data still involve manual labor. But creating those annotations is absolutely necessary. Fortunately, there are automation tools and services available to help.
Q: Are there any new techniques for dealing with edge cases in AVs?
Over the last few years, work on self-driving cars has propelled a lot of development in active learning, where an algorithm can direct its own learning by querying us, its users, for future examples. This often happens by having heuristics built in to identify failure cases, which can be sampled to create more effective examples, as with Tesla’s fleet learning.
Active learning has become more and more studied in machine learning because people are starting to come across some limitations that aren’t from data alone, but instead arise from not effectively using this data. We have finite time and compute resources to create and learn from annotated data. Active learning can help us use our time and resources more efficiently to label and iterate on the important subsets of the data.
Q: How do you see this field evolving, and when will AVs be “safe enough”?
That is the billion-dollar question. For some time, we’ll continue to face the challenge of both making AVs at least as safe as humans and convincing the public and regulators that they are safe. But we will get there. I think of this like the history of commercial aircraft, which over the years have come a long way in both their safety and the public’s perception of their safety. But truly universally safe AVs will require constant improvement and close cooperation with regulators.