Scale Events
timezone
+00:00 GMT
Sign in or Join the community to continue

Taking Autonomous Driving from Research to Reality with Drago Anguelov

Posted Jun 21, 2021 | Views 1.2K
# Transform 2021
# Fireside Chat
Share
SPEAKER
Drago Anguelov
Drago Anguelov
Drago Anguelov
Distinguished Scientist and Head of Research, Waymo @ Waymo

Drago joined Waymo in 2018 to lead the Research team, which focuses on pushing the state of the art in autonomous driving using machine learning. Earlier in his career he spent eight years at Google; first working on 3D vision and pose estimation for StreetView, and later leading a research team which developed computer vision systems for annotating Google Photos. The team also invented popular methods such as the Inception neural network architecture, and the SSD detector, which helped win the Imagenet 2014 Classification and Detection challenges. Prior to joining Waymo, Drago led the 3D Perception team at Zoox.

+ Read More

Drago joined Waymo in 2018 to lead the Research team, which focuses on pushing the state of the art in autonomous driving using machine learning. Earlier in his career he spent eight years at Google; first working on 3D vision and pose estimation for StreetView, and later leading a research team which developed computer vision systems for annotating Google Photos. The team also invented popular methods such as the Inception neural network architecture, and the SSD detector, which helped win the Imagenet 2014 Classification and Detection challenges. Prior to joining Waymo, Drago led the 3D Perception team at Zoox.

+ Read More
SUMMARY

Drago Anguelov discusses the state of perception and self-driving, future research, and broader trends in the autonomous vehicle industry.

+ Read More
TRANSCRIPT

Brad Porter: Thank you Raquel. Staying on the topic of autonomous vehicles. I’m excited to introduce our next fireside chat guest, Drago Anguelov. Drago is a distinguished scientist and the Head of Research at Waymo. Drago focuses his research on pushing the state of the art in autonomous driving. Prior to joining Waymo, Drago spent eight years at Google first working on 3D vision and pose estimation for street view. And later leading to the development of computer vision systems for Google photos. Scale was initially founded with a focus on self-driving and perception, and we are always grateful to connect to Drago, to learn about his research and how the way my team is working to make autonomous driving a reality. Drago, welcome. And Alex, thank you, the stage is yours.

Alexandr Wang: Drago, thank you so much for joining. We’ve known each other for a while, but always excited to chat with you. And thanks so much for joining the conference.

Drago Anguelov: Thank you for having me. First of all, Alex, it’s a pleasure as always, talking to you.

State of Perception

Alexandr Wang: I want to start off by asking a question about perception. You started working on perception over a decade ago, and there have been incredible advances in the field over that time period. Where do you think the state of art is for perception? Do you think it’s so good that it’s no longer the bottleneck for many robotics problems?

Drago Anguelov: I have been working on perception indeed for a long time since around the year 2000, give or take. Initially when I was at Google, I would say the state of perception was really good for face detection and maybe image matching and the little bit of RCR, but not too well. And then, I mean, it’s gone through a tremendous revolution. And in the last five years, I’ve been involved with autonomous driving and perception for autonomous driving. And I would say that, every year we have quite significant advancements and improvements still, which is very exciting. I would say that currently there is a lot of this feeling in the community, some people say, “Oh, perception is solved, and let’s now solve behavior prediction and planning.” I think perception naturally is the first place in robotics, where deep learning entered and transformed. For example, Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton’s seminal paper and work, AlexNet in 2012, just blasted through the existing benchmarks of the time.

Drago Anguelov: And so perception always, is the place in robotics where deep learning comes first and then it moves up the stack. And so, it’s been doing this for a while, that said, there’s still some interesting problems in perception that we’re working on, and they relate not so much to the common cases but the big market on the long tail. This is being able to address the rare events that come at you, you need to be robust to those. That’s a lot of training data potentially needed, which leads to the question of, how do we develop models that need less and less training data? And there’s a whole bunch of techniques that can be merged in the last two or three years that are very powerful. Also in the last two or three years, there’s been a lot of progress in understanding the three-dimensional nature of the world and building three-dimensional models, whether from Lidar it’s very natural, also with cameras.

Drago Anguelov: And so where at this point, where perception works very well for the vast majority of cases, but we still need to understand how to best handle it for the autonomous driving stack, both the rare cases and also what types of perception they want. What is the right intermediate representations that should come out of the perception model that are most amenable and helpful for prediction planning, and of course, it’s suitable for simulation. And this is something that we’re still working out. I’ll give you one example, maybe, so it’s a paper we published at Waymo called VectorNet. This was worked from a while back, and we published it last year. That shows that if you model the map as a set of polylines, as opposed to just rendering it as an image, your modeling capabilities significantly, or predicting agent behavior significantly improve.

Drago Anguelov: This is a lot more succinct and structure to presentation. And then that leads to the question, “Okay, well, what’s the right intermediate representation of a map, we should be constructing, even to feed to the next stages in the modeling to deal with behavior? What’s the right conductive biases in our domain?” So a lot of these representations we’re working out now. There has been … for every specific task like, if you take 3D boxes, maybe you can take 3D flow, 3D boxes, tracking and so on. Like for every task, we now have very preformant models. But how to put the whole thing together, and which outputs really matter? How to combine them? Which leads to self supervision and a bunch of other ideas. This is something that the field is working through now. And it’s a very exciting time still.

Limitations of Deep Learning

Alexandr Wang: Yeah. So you mentioned two areas of interesting development, right? One, is being able to deal more robustly with rare cases with a long tail. And the other one is, centralizing on what exactly is the right architecture or set of intermediary data or intermediary representations that should be used within the industry. I wanted to centralize the first one, so what do you think are the limitations of deep learning in its current form to achieve the robustness and generalize ability that is necessary for, truly safe self-driving cars?

Drago Anguelov: Deep learning is a great technology so far. And when you specify a loss function and you pick a reasonable architecture search space to optimize that search function, it works really well at this, right? Typically, what we have done traditionally is specify average losses of some kind, like our loss over many examples, and the neural nets are really good at this. And this type of closed loop behavior, closed world behavior, which is different from closed loop actually. Closed world is … in ImageNet, originally, when we had this benchmark and you have a thousand categories and you need to classify between them. And at the time we trained the model and it’s really good. I was like, “Oh, that’s an amazing model. It can do ImageNet,” And then we start applying it to regular images, and it’s not that good in their images, because it starts seeing things it’s never seen.

Drago Anguelov: And so a little bit of this, is that the core of using deep learning robustly in the system. And in our domain, it’s very important to have robust usage. You need, ideally, very large capacity, powerful neural net models, I think we’re moving in that direction. But every once in a while, when you go outside the domain, the data was trained, it may give you a different answer, than what you would think that you want. And, that answer can … it can be very confident, potentially, if trained neatly on that answer. I think when you want to build a robust system, which you can guarantee there are certain constraints met, and you want to put a big neural net at the core of it, that leads to the question, “well, what’s a reasonable way to do this?” And that’s a very exciting field, currently it is developing.

Drago Anguelov: And there’s many possible solutions. I think one of the interesting things that probably is part of the answer, is having the network’s mechanisms, and there’s quite a few of the networks, to give you a notion of their own confidence in their prediction. And so when the network is telling you, it’s not that confident, then you can have a fallback. You can have a more hybrid system, or it can have one that builds in, maybe not quite as general as a mapping function, but builds a lot more inductive bias in the domain and it can handle the cases you have not seen as much. Right? And I think we’re working through how to build this type of hybrid systems.

Alexandr Wang: Yeah, those are super insightful answers. Certainly this kind of problem of … neural networks are good at solving for average loss, but in the real world, average loss isn’t necessarily what you’re optimizing for. I mean, it’s certainly very insightful. You mentioned a few of the directions that you guys and you’ve seen in the industry, you all are exploring for overcoming some of these challenges. What are the directions for a holistic view of improving deep learning to solve some of these gaps? Where are some of your views on the research directions that are most exciting?

Drago Anguelov: I could talk a bit more about the autonomous vehicle domain, specifically. And we are blessed with having a wealth of very exciting problems to be addressing. I have a few favorite ones, but I think, I always would start at the core of ability to make fast progress on autonomous vehicles, is your ability to specify and optimize a specific goal and objective. And in our case, simulation is a really crucial part of this objective, because you need to ideally observe the system perform, because it’s a robotic system, and if the impact of its decisions is, happens over time. And so you want the system like this, you want scalable evaluation of your system performing in realistic circumstances. Now I have two ways of doing this, you can do this by enacting a bunch of scenarios like we do at Castle. It’s a 91 acres or something area that we have, that we’ve enacted over 40,000 various scenarios, and to see how the vehicles stack react.

Drago Anguelov: Of course, we have the vehicle out there in the world, driving with safety drivers in the loop. But I think simulation is really core, because that’s the truly scalable medium, where we can try safely, a vast majority of scenarios that you might not even want to try out in the real world. And having that world be realistic, in the sense that, the behavior of the system there approximates that one as close as possible to the real world, then you have an ability to get to play out the scenarios you’re interested in at scale. I think that is one of the core problems that unlocks yet bigger penetration of autonomous driving. We can move from one city or two cities or three cities to considering dozens. And so that’s at the core of the problem and it’s a very exciting problem.

Drago Anguelov: I think, when you talk about difficult problems and areas, I’m personally excited about? Like I’m a bit of an engineer, so one thing I would say is … I keep thinking as I watch the autonomous driving stack about, “what are the right interfaces and what are the right representations, to power these interfaces through the stack?” And if you’re not directly in the autonomous vehicle space, you can think, “okay, there is a perception prediction interface.” And we talked about some of the issues there and like, “what is the right representation for a map or uncertainty in the map?” Like, “as opposed to just having boxes for a set of agents that you can track, which we can do pretty well.”

Drago Anguelov: “What other things need to be passed on that are helpful in predicting the intent and behavior of the agents in the environment?” Which is one of the hardest cases. When you move from prediction to planning, right? These two problems actually tied, and we’re still figuring out the best way to solve them. And I think, quite likely, the solution is a model that is joint prediction and planning in some sense, and I can describe it in what sense. So you want the plan that takes into account the predictions about the environment, but you want only the predictions that relate to your plan in some sense, and validate that it’s safe relative to them. So, I can say, “Oh, what happens if I turn right here at this intersection?” Right? They want predictions. “What would everyone do then? That’s already started with my plan.” So there’s this beautiful dependence, that I think there are interesting ways to address and solve it.

Drago Anguelov: And last but not least, there is the interesting problem of planning and learning how to plan in an environment. And at the same time, you want to and get it validated in simulation. But what is simulation? There’s intelligent agents in the simulator that also make their own decisions in response to your decisions.

Drago Anguelov: And so a good simulator potentially can leverage a lot of similar technology, not the same, but similar to what you do for planning. So there’s a lot of very interesting questions that I think, all of them, they come in general themes, like, “what is the system design and the framework that allows you to keep scaling this?” right? I think at Waymo, I would say, we have achieved L4 driving in Chandler. This has been an area where we have given many thousands of rides fully autonomously. And currently our service there, which is an area the size of San Francisco is fully autonomous. You get what we call a rider only rides.

Drago Anguelov: And members of the public can and do ride and record videos and posts, right? And this has been happening … This is a program that Waymo has the muscle to maintain driverless operation in Phoenix, in that area since 2017, in some capacity, all this time, while they’ve all been in our stack, and that is great muscle that we have right. And now, yes, we can solve that area, of course, maybe we would label a lot of data to solve it, and so on, and observe a lot of the scenarios. Now in the next phase, we want to bring it many more places, much wider. We want a system and the Doppler system such that it learns a lot more from data, a lot less labeling, right. Leverage as much simulator capabilities, we tried to build the most advanced simulator and leverage capabilities there, and so on. That’s a direction where we’re looking at, scaling.

Behavior Prediction

Alexandr Wang: You mentioned a few areas that you all are diving into or looking into more closely. One of them was behavior prediction and I know we’ll end up talking through all the different things that you just mentioned. But for the problem of behavior prediction, I think a few weeks ago, you actually announced an expansion of the Waymo Open Dataset, beyond perception for the first time. What was the strategy and thinking behind that?

Drago Anguelov: When you think of behavior prediction, being a core task, and it’s currently a task that is very active. In the research community, there is a lot of progress being done with machine learning for behavior prediction. This is a task that is core, because the head of prediction in its roots, requires a deep understanding of the scene semantics. It requires an understanding of context, “which traffic lights are on, what is the intersection rules, what are the old signs, what is the construction here telling you,” and tying it to how everyone behaves. And so, it’s essentially imitation learning at its most pure sense, with a specific loss function that ultimately you pick, as it relates to your planner. But this is a core area, which something …

Drago Anguelov: When we released the Open Dataset initially, the Waymo Open Dataset has been incredibly successful. We had a great set of challenges last year, with over 150 participants. It has been cited by a lot of really strong papers, which is I think, the thing I most look forward to, they use this and they developed really great models, usually in 3D perception or detection and tracking.

Drago Anguelov: I think one thing we realized when they made that dataset, is that, even at its scale, which is close to 2000 segments in four cities, for behavior prediction purposes, this is tiny. And I think, if you do the math at the back of your head, you start understanding why. Well, a lot of sense of that in the open dataset is 10 hertz, and you maybe have 50 to a hundred agents at least. And the 20-second sequence now has 200 full Lidar spins and shots of the camera with a hundred objects, that’s a lot of examples, so you’re in the, potentially, many tens to hundreds of thousands of examples, just from a sequence. When you look at their specific behavior interaction that happens, it can just happen between two agents in the whole sequence once.

Drago Anguelov: And so if you want to build these models that understand behavior, you need dramatically more data and you also want data that has mined interesting interactions, in the first place. And so then, what we did is, we said, “Okay, well, how do we even provide all this data to the research community? Well, let’s give them a process representation of the scenes with the map.” And we’ve processed with our research version of outward perception as we call it, which is very high-quality models that you can apply on the Waymo Open Dataset, for example, and get very accurate data tracks of all the objects in the scene. So we applied this to 100,000 scenes in that model, and that made you bounding boxes and tracks for all the objects, super accurate moving in space time.

Drago Anguelov: Now that’s a great behavior prediction dataset, because for a lot of these interesting maneuvers like cuttings, or people negotiating in an intersection, or a bicycle weaving between cars, we have examples for all of that, so now we can study it. And we made it such that … compared to other dataset, we made that benchmarks even more long-term prediction. So some parts, datasets maybe do benchmarks with three to five seconds in the future. We made our benchmarks for questions after eight seconds in the future, we made the metrics more stringent and demanding as we think better reflect the demands on behavior prediction.

Drago Anguelov: So we moved away from … there’s a common metric called, misratom mean KD which is, if you have say six trajectories guesses, at least one needs to be close to the true one. Right? These are those types of metrics we might have moved to something that had more stringent and we also moved to a specific interactions challenge that models had ability not to predict just for each agent or they’re going to do independently, but predict joint futures for groups of agents, and so we have one task on that. And so I think this dataset actually is, I think, highly exciting, it’s a very rare type of data that so far has not been available as much in the community. And while they’re at the dataset, we explicitly mark the interactions, we made this a focus. There’s a lot of very interesting scenarios in Waymo that have dramatically more data behind the scenes.

Drago Anguelov: But out of that data we caught some interesting examples. And so that’s what we were sharing with the community. I think it makes for a ton of very interesting research. And honestly, this research can be beyond the initial, just individual prediction or shared prediction of a couple of agents. There’s more benchmarks you can do that are very exciting. And so we will be doing this as we go, like our whole intent with Waymo Open Dataset has been, it’s a living dataset. And as we learn and engage with the community, we keep releasing new features and challenges there. And this is the next step in this evolution, but it’s not the last. And we got a lot of positive feedback. And that feedback also helps us kind of, “well, we need to gear up on it internally and organize an effort.” And a lot of people on the team in Waymo research have put a very significant amount of effort into making this dataset and sharing with the community. And yeah, I’m personally very happy to be able to announce those new challenges and see how the world does with them.

Alexandr Wang: One interesting challenge of behavior prediction versus a perception is that in perception, the algorithms, they really achieve near perfect performance, especially as the field has developed over the past many years. In behavior prediction, there’s this intrinsic uncertainty in the responses, right? Like humans can do different things, agents can do different things, and there’s not necessarily a fully right answer or there’s at least like a distribution of outcomes. How do you think about … what the limit is … or what’s the goal in terms of performance for behavior prediction should be, and how do you think about measuring that?

Drago Anguelov: That’s a great question. And I’ll take a step back and observe something, right? I think what is the output of perception? Well, the output of perception is a representation that is helpful to do behavior prediction and planning on. What is the output of behavior prediction? Right. Well, instead of representation of the world or the behavior of others, that is helpful to plan with, also, so inherently in some sense, your true value and definition and metric on behavior perception goes through your planner. Now, when I define the challenge, you want to abstract yourself a step away from that, right? Ultimately … And there’s different representations you could do behavior prediction on. And I think the field so far has settled on kind of this multiple trajectories, maybe with location uncertainty on them, if one desires. I think, that’s not the only choice, so you can do a what’s known occupancy grids or maps, especially, things for a lot of considering they’re the one on parametric representation as opposed to the parametric, KirkseyTrajectories with confidence and maybe location gaps and uncertainty on the various time positions along the trajectory.

Drago Anguelov: Now, the advantage of these trajectories is it’s a very compacted representation, so you can represent very long-term interactions and behaviors very succinctly, and that allows your planner then to check that. Ultimately you’re not going places where others are going to be and not inconvenience them. It’s a very rich representation in a very compact frame. Now, the problem, when you have the trajectory is, of course, what is a good property to have? Well, you don’t want to have too many, and they need to be in the right places, and so you want the metric that reflects that. I think, personally, I believe a lot of the current metrics, they’re well correlated with progress in the field, but they don’t quite measure what they should be measuring. And I’ll give you the example, so there is a metric called, mean ID that is MS right.

Drago Anguelov: And both of these metrics, they’re quite good, they say you’re allowed six trajectory guesses. And then we penalize you on the clauses, like the clauses need to be closer, well within two meters, then it’s happy. I think, I think there’s something else happening. You actually, this is a weak requirement, you can do stronger, because you actually don’t want to produce trajectories, if possible, as long as it gets directly. So there is a value, any way you do it. You want diversity in your trajectories, you want to cover the molds, but you don’t want to spray them around, there’s a penalty. And so you want to modify the metric potentially that person to believe is to have more of a penalty. I mean, ultimately again, I would say that the final measure of a good BP system is the planning metrics. But if you abstract yourself from that, I think the one that we’re proposing is closer to the truth, even though we’re providing the other ones still, because you want to have continuity, people need to understand how they do on the metrics they’re familiar with already, right.

Drago Anguelov: That we’re trying to introduce a new one, that’s a little more stringent. And that’s inspired from object detection actually, it’s not that original. The difference from object detection is in behavior prediction, you can never predict perfectly everybody, with a single guess, because all the agents in the environment are very naturally multimodal, right, they cannot be sure what the pedestrian is going to do, when they sit at the intersection. They could do potentially two or three things very easily, right? And so we needed a representation, some metrics that can handle this multimodality well, and I think the other thing you need to do is, you don’t want to just go for the average performance.

Drago Anguelov: You say, “Oh, on average, agents need to predict well with the trajectory.” The problem is, on average, mostly everyone’s moving boringly, right? I mean, we’re mostly driving straight at constant speed, at constant velocity, and we are walking straight at constant speed and constant velocity. That’s great, but for safety, you want to capture the rare behaviors that actually affect you. And you should not just have to re-trial everything, you need to be smarter than that. Then typically the people deal with this by bucketing into types of behaviors and making sure that … for all types of behaviors circle, not just on average across just instances of things. So that’s something else we, we have put.

Simulation Systems:

Alexandr Wang: I mean, I think you’re touching on a very deep and embedded issue of machine learning and AI for the future, which is how do we ensure that there’s maximum alignment between what we ultimately care about for the system in the real world and the metrics that we assigned to them. Very relatedly, I wanted to actually dig in … we spoke a bunch about simulation a little bit ago, and actually, you spoke about its importance for being able to build scalable self-driving systems. What do you think of the current limitations of simulation systems and sort of the exciting vectors of improvement and research?

Drago Anguelov: Ultimately, in the core of every simulation is you want to play out scenarios that you care about and check your performance of your whole system in them. Right? And you want, of course, that you played to be realistic, and furthermore, you want to pick the right scenarios to play out because most of the time it’s boring, right? And so there is the following … I would say, starting from a high level, “what makes a simulation environment realistic?” There’s two factors, and I think the first, is very clear to everybody that gets involved with, well, you on sensor realism, right? So if you move around the environment you want for your camera and Lidar and radar, depending what sensors you have in your vehicle, to be simulated accurately, such that, then you can apply your perception system and the right outputs come out. Right?

Drago Anguelov: And I think that’s one useful notion of simulation realism, so sensor realism. There is another one that is underappreciated, people have not been talking about it until recently, which is behavior realism, and this leads to agents. Ultimately, the big challenge of driving is, how do you navigate and share the world with humans? And that involves keeping the pedestrians, and bicyclists, and all the vehicles that they are driving, and that’s a very complex negotiation and interaction that happens. And unfortunately, humans are far from perfect and they’re far from deterministic. And we want our simulator to invent scenarios of the type that you see in the real world that you need to deal with. If you look at accidents, even for the vehicles case, 94% of the accidents happen because there is suboptimal human judgment, right? And you want to be able to replay these accidents and see how it would do, because it’s not even just, you should not just do well in scenarios where … of course shouldn’t cause yourself accidents. That’s great, right? Then you need to follow the rules and give people enough space and generally be safe.

Drago Anguelov: It also helps very much that you mitigate the mistakes of others. So a good driver needs to be able to also do that in a reasonable measure. Actually, something we just released a couple of weeks ago is a paper where we reconstruct using our simulation environment, a set of accidents that happened in Chandler, in our area where we drive. And we put our vehicle in the shoes of the drivers that were part of the accidents, and it could show that they can significantly mitigate the vast majority of it. Right? Then this is one example of this.

Drago Anguelov: But those same agents are core to your simulated realism, because behavior is one of the core things you need to be ultimately dealing with. Now beyond just this fact, I would say, a simulator is something else, right? I mean, the main thing of this is, you get some assurance and safety guarantees, it’s part of your safety strategy, and by the way, Waymo is one of the very few companies … We have put out a safety strategy. It’s a multi-pronged strategy, it’s a complex space, we use a whole bunch of techniques, including simulation and replay of scenarios, being part of them.

Drago Anguelov: I think when you’re doing this task, you’ll also multiply the amount of experience you have, and I’ll explain what that means, this is a key multiplier that you can have. So you capture a set of scenarios, and you can’t just replay them namely because if you start doing different things, then the logged agent, start doing things that completely don’t make sense. But if you can code the agents, you can start playing completely different endings to scenarios, than what you saw. That is essentially a data multiplier, you can get that 10 to a hundred types of different outcomes, because now you give the agents different goals and they start executing those goals. And you test yourself in a variety of ways that you did not before, right?

Drago Anguelov: So when you talk about efficiency, that’s your core multiplier, and something that we had at Waymo research have been investing for years now. So I think we have some of the most sophisticated simulating agent frameworks and set ups, and agents. We gave a brief talk last year at the NeurIPS conference on some of the work, but I think that’s generally an area that is increasing the key for autonomous driving. Hence, I went so long. And I’m excited about it. It’s a great area.

Alexandr Wang: Thank you so much for joining us for our inaugural Scale Transform conference, Drago. It’s so great to hear from somebody who’s really at the cutting edge and working on some of the hardest problems facing the self-drive industry. And that’s such clear ideas about how AI needs to improve. Thank you again.

Drago Anguelov: Thank you. It’s a pleasure to be a part of your first inaugural conference. Thanks for having me.

+ Read More

Watch More

44:51
Posted Oct 06, 2021 | Views 33.9K
# TransformX 2021
# Keynote