The Challenges of Full-Stack AV Development With Jesse Levinson of Zoox
Jesse Levinson co-founded Zoox in 2014 and is the company's Chief Technical Officer (CTO). He graduated summa cum laude from Princeton University and completed his Computer Science Ph.D. and Postdoc under Sebastian Thrun at Stanford University, where he developed algorithms for the $1M-winning entry in the 2007 DARPA Urban Challenge and went on to lead the self-driving car team for five years. Levinson also co-created Pro HDR, the first HDR app for smartphones, which has been purchased by more than a million people since 2009.
Jesse Levinson co-founded Zoox in 2014 and is the company's Chief Technical Officer (CTO). He joins Scale AI’s Head of Nucleus, Russell Kaplan, for a fireside chat to discuss the challenges of bringing a completely autonomous vehicle stack (vehicle platform + sensors) from the lab to real-world streets for autonomous ride-sharing vehicles. Together Jesse and Russell explore the biggest obstacles that need to be solved when making robot taxis. How do you train a model to generalize for the unknown? How can a model respond to new challenges safely or plan a new response in real-time? What are the limitations in perception for different sensor types? As you train a model to learn edge cases and scenarios, how do you prevent regression in other areas? Jesse shares how Zoox uses simulation to take advantage of real-world data to perform realistic testing at scale. Join this session to hear Zoox's insights on taking an entire end-to-end technology stack from experimentation to reality.
Russel Kaplan (00:22): For the next speaker, I'm excited to welcome Jesse Levinson. Jesse co-founded Zoox in 2014 and is the company's chief technical officer. He graduated summa cum laude from Princeton University and completed his computer science PhD at Stanford, where he developed algorithms for the one million dollar winning entry in the 2007 DARPA Urban Challenge, and went on to lead the self-driving car team for five years. Jesse, it's great to have you. Welcome.
Jesse Levinson (00:47): Sure. Well, great to be here. Thanks for having me on the program.
Russel Kaplan (00:51): Jesse. It's so great to have you, and I wanted to start off by hearing a bit more, if you could share your journey that led you to founding Zoox.
Jesse Levinson (00:59): So I've been working on artificial intelligence for couple decades now, which feels like forever. I've always been pretty inspired by trying to get computers to do things that people are good at that computers aren't good at in particularly in applications that could be genuinely useful to society. And it's hard to think of a use case that would be more impactful and more an official then removing the need for humans to have to drive cars. When you look at the time that's wasted, the number of lives that are lost, the environmental impact of all that needing to own your own car times 2.2 cars per family, it's pretty profound.
Jesse Levinson (01:37): So this is just something that I've been very passionate about. And I spend a bunch of time during my PhD and postdoc getting to work on and run the small self-driving car team at Stanford. And then I met my co-founder Tim in 2014 and we decided to start this crazy company. At the time, it seemed crazy because we really wanted to rethink the way that people moved around cities, not just from a technology perspective, but how could technology enable a product and a business model that was just very different from anything people were doing before.
Jesse Levinson (02:12): And we got really excited about this idea that if you could unlock the true power of autonomous mobility, not only could you free people from having to drive their own cars. But even more powerfully, you could remove the need for people to own their own cars, at least at the kind of scale you see today in our society. And it's that shared fleet of autonomous electric vehicles now known as robotaxis. It's truly transformative for transportation in cities, and that's what we've been working on at Zoox since 2014.
Russel Kaplan (02:44): It's a super inspiring vision. And you've made, as you mentioned, so much progress on this for a really long time. I want to go back to that time during your PhD. It seemed like there was a time where at some point in the AV industry, if you talk to any of the other autonomous vehicle startups and ask them, "How are you solving localization?" They would say, "Well, did you read Jesse Levinson's PhD thesis?" And so I wanted to dig in on that with you. What is exactly the problem you worked on for that thesis and how did you solve this AV localization issue?
Jesse Levinson (03:15): I think that's far too kind. I actually worked on a number of different localization approaches during my PhD. One of them was actually in the context of the DARPA Urban Challenge. So this was the first time anybody tried to get cars driving autonomously with other moving vehicles in a city-like environment. And that one was actually really tough because we weren't able to pre-map the course. So we got a vector graph based description of where the roads were and where the lanes might be. And then we had to drive that live without ever getting to practice it.
Jesse Levinson (03:47): So what we did there was we came up with an approach that was probabilistic that basically tried to match the sensor data. So we'd look for, for example, curbs and lane markings using LIDAR reflectivity. And we tried to probabilistically match that with where we would expect to see those things based on the very course map that we got from DARPA and basically just built this probabilistic filter that gave the planner a lateral distribution over where we were relative to the map. And that actually worked surprisingly well. I think we were the only team that always managed to stay in the middle of the lane and we never veered outside the lane.
Jesse Levinson (04:24): And what was cool about it was, it was really simple. I mean, the entire algorithm was I think under 1,000 lines of code. Many of the other teams at the DARPA Urban Challenge had 10 or 100 times more complicated algorithms. So we were proud of how simple that was. But for urban driving, you need to do better than that. You really want high precision. And so what I worked on in Stanford was basically ways of building maps of the environment using 2D or 3D features. And then basically again, probabilistically comparing the sensor data you're seeing in real time with what you'd expect to see in the map and giving a probability distribution over where you might be.
Jesse Levinson (05:03): So instead of saying, I'm definitely here or I'm definitely there because almost nothing in life is totally deterministic. If you instead can encode a probability distribution over where you are and say, "Look, I'm probably here. I'm almost certainly not here. I might be over here." Then that's a distribution is really useful for the rest of the pipeline. And so that's a lot of what I work with at Stanford, but at the time people weren't really using LIDAR for that at all. LIDAR was a pretty new technology, especially multi-beam scanning LIDARs. And so the idea of, for example, making reflectivity maps of the environment and then localizing against them. That was a new concept 5,000 years ago when I used to be a PhD student.
Russel Kaplan (05:45): It's really interesting to hear about the evolution of that approach from your DARPA days to your PhD days. Well, what about now? As you've built Zoox, what have you had to change in terms of your approach to developing an autonomous vehicle as the company has grown and scaled?
Jesse Levinson (06:02): It's so fascinating because on the one hand you look back at the Stanford days and you're like, "Well, we kind of had all the pieces." We had a planner and we had some perception and we could do some prediction and we had mapping and localization and calibration and we even had a little bit of simulation. But the scale and the fidelity and the quality and the reliability and the number of scenarios we need to handle is just so vastly different in real life in cities to deploy something commercially and say, this is safer than humans compared to doing research and some little experiments here and there in school. So it was awesome what we got to do in school. It was a incredible experience, but the caliber and scope of what we have to do now is just very, very different.
Jesse Levinson (06:43): And so that manifests itself in multiple ways. So some of them are having multiple modalities that you can use to, again, ensure that no one modalities failure is going to bring down your system. So talking about mapping localization, how do you use not only LIDAR, but also camera and Radar to build multi-layers of your map so that in case one modality struggles, you still have something to go on. It also means, for example, not overly relying on your maps.
Jesse Levinson (07:09): If you build a system that's too heavily map based and it can't handle when the world changes, you're not going to be in good shape either from a safety perspective or from a completing the mission perspective and so on for really all the components of the system. It's just a profoundly large research and engineering problem to do this at scale and to deploy it in cities, even though conceptually, a lot of the ideas did arise from academia 10, 15 years ago.
Russel Kaplan (07:36): It's really interesting to see how that approach has scaled and the changes you've had to make. What would you say among those, what are the single biggest obstacles to making a robotaxi operational?
Jesse Levinson (07:51): I think the single biggest obstacle is that there isn't a single biggest obstacle. And what I mean by that is it's all of the little things. A robot is only as good as the worst thing about it. So if there were one super hard thing and everything else was easy, then ironically, that would be easy because then we just put all of our smart people on that one hard thing and then we probably solve it and we'd be done. There isn't any one thing we're like, "Oh my God, I have no idea how to solve that." Every single problem that we've uncovered or every single environmental feature... We have good ideas on how to handle that. Nothing has stumped us.
Jesse Levinson (08:27): But again, when you talk about super human level safety and people are pretty good drivers, especially when they're paying attention. You just really have to get the details, right? And you have to balance all these different things. And so the hard thing is that you keep can't take any shortcuts and until everything works incredibly well, you don't have a product you can put out on public roads.
Russel Kaplan (08:46): I want to dive a bit deeper into that. All these long tail challenges that you're referring to, are there any that have maybe stood out in the course of Zoox's development that were particularly surprising? Edge cases that you couldn't have possibly forecast?
Jesse Levinson (09:00): I don't think we've seen anything that was like completely unforeseeable. I would say that some things that seemed like they'd be really hard, ended up not being quite as hard. So one of the things that we used to think was going to be, "Oh my God, this is so hard." Is this idea of doubly parked vehicles where you're driving around and then all of a sudden there's a FedEx delivery truck or somebody's just gotten out of their car to drop somebody off and you have to use your contextual semantic scene awareness and realize this is not the car you want to wait for. You need to go around them. And going around them often requires going into oncoming traffic sometimes even crossing a double yellow line.
Jesse Levinson (09:35): And before we tackled that problem, we were like, "Oh my God, that is so hard. I don't know." How do you get that right almost all the time? It turned out to be not that hard of a problem because you have so many contextual clues from the environment and from all of your sensors that you can actually do that probably better than humans and the actual mechanics of maneuvering around the vehicle, because we have sensors on all four corners of the vehicle. We can see around objects also better than people can. So that's an example of something that sounded really daunting, but when we got into it, it really wasn't super, super hard.
Russel Kaplan (10:08): So is that an example of a scenario that you have a clear implementation plan for that you've been able to solve? Because I want to tie it back to your earlier comments on dealing with the the inherent uncertainty of the world. And so there are many cases where there's a doubly parked car and so there's a behavior that will work to get around them. Is that a behavior that now you've gone and defined and it's generalizing to a lot of those scenarios or are there cases of triple parked cars or rows of doubly parked cars that confound the base implementation and maybe require further refinement?
Jesse Levinson (10:40): That's a great question. It's also one of the reasons why as you build these systems, you want to build them in as generic and extensible ways as possible. So as you might imagine, the first time we ever tried to work on doubly parked vehicles, we started with a pretty naive implementation that it would work only with one vehicle and it created a particular type of path that went around it and it would work 80, 90% of the time. And we were like, "Hey, this is really cool. Sometimes we can go around a DPE." But again, you quickly realize that doesn't really scale. And then you're like, "Well maybe we can enumerate all the possible things and all the possible configurations." But then you're like, "Well, but then we need a double-doubly parked vehicle module and then a triple one. And then what if there's oncoming?"
Jesse Levinson (11:19): You quickly realize that you just can't enumerate that combinatorial explosion of possibilities. And so you just need a smarter, more extensible way of encoding these things and having the planner get around them. And so one of the concepts there that's really important is this idea of we call it unstructured motion. So most of the time you're driving, you're following a lane and maybe you're changing lanes, but there's a limited set of things that you're doing, but you also want the ability to do unstructured motion, which is to say instead of having to predefine these 17 types of trajectories or these exact configurations, we want to be able to just take a generic category of thing.
Jesse Levinson (11:57): For example, the road is blocked and the normal follow your lane and wait for the thing in front of you is not the right thing to do. Can you build a unstructured motion model that can do arbitrary path planning in the scene, still trying to follow the rules of the road but they explore the space using it's research. And those are the types of algorithms that, again are extensible. And sometimes people say, "Well how are you going to code for every possible edge case?" Of course, you can't do that. If you read your planner, so there's 83,000 if statements, you're never going to be able to finish it and good luck debugging the thing.
Jesse Levinson (12:34): On the other extreme, people are like, "Oh, why don't you just give it a lot of training data. Do some end-to-end learning, and then you'll have this awesome AI driver." I think that's also really challenging because (A) I don't think these end-to-end algorithms are quite there yet, and (B) in the other sense, those are also really hard to debug because then if your vehicle doesn't do what you're hoping it'll do, you're like, "I don't know why the hell that happened." Do I look at this neural network weight somewhere deep in the network?
Jesse Levinson (13:00): That's pretty opaque too and there are some ways of trying to introspect those things, but we think that again, that balance that hybrid approach where you use a machine learning to inform the search but you still have some rules and some basic parameters that you can tune to govern the behavior of the vehicle. We think that makes a lot of sense and we think that can take you really far.
Russel Kaplan (13:22): Super interesting. Yeah. What a tough balance to strike of making it configurable enough that you can react to issues you're seeing in the field, but not have to actually enumerate these edge cases that tree raised approach. Yeah. Makes a lot of sense. Do you see when you expand new locations that those techniques, you realize you have to adjust them in some way, or what are some of the challenges you've encountered as you've grown and expanded to more locations?
Jesse Levinson (13:47): One of the most fun parts of our job is when we go to a new geo-fence because we get to see how does everything we built so far generalize and scale to something we haven't seen before. Your question is very relevant to what we were just talking about, because again, if you built a system that was overly constrained based on what it had seen, and by the way, that can be true for heuristic based systems or machine learning systems. You can also have machine learning systems that you take them to a new city and they're like, "I don't know, I've never seen trees like this." God only knows what's going to happen, but that's not a good situation either, and that's yet another reason why we take this balance approach. And what we found is very encouraging and also, honestly, not that surprising.
Jesse Levinson (14:28): What we found is that when we go into new parts of a city or even brand new cities, as long as the topology is something that we've seen that type of thing before, even if it's otherwise fairly different or the scenery is a bit different, we do great. We really just, even on our first try, we can drive with no interventions. And so, as an example, the very first time we ever tried driving in Las Vegas, which was way back in 2019 now. Our very first several mile drive, no takeovers were necessary. Which is pretty cool, right? We'd never driven an inch in Las Vegas and with just a map of the city, we were able to drive quite well. That doesn't mean we could drive a billion times with no interventions and there was of course work to do.
Jesse Levinson (15:15): But the key thing again is from that road network and topology perspective, if you've seen those kinds of things before you can just drive. And so the more you've figured out, the easier it is to go into new areas. And really the only time you have to do extra work is if there's a feature that you just haven't encoded, how the system should behave in that situation. So for example, if you haven't handled rounds before, then you go build a map with a roundabout. You're going to have to teach the planner what to do in a roundabout situation. But again, the more you've seen, the fewer surprises you'll generally see when you go into new areas or even new cities.
Russel Kaplan (15:52): Makes a lot of sense. So from a novelty of scenario perspective, it's mostly around the planner if there are new planning behaviors that need to exist. Have you encountered challenges on the perception or prediction system in those new environments as well?
Jesse Levinson (16:08): Sometimes we have less so, but absolutely. So in San Francisco, there's a type of vehicle called a petty cab. And it's basically like a person like wheeling this little mini vehicle around. It's like not really a bike, not really a car. It's not that the system, if it hasn't seen one of them before is going to just run into the thing. By the way, that's again, one of the powerful things of having this multi-modality sensor approach. If you build, for example, a vision only system, and then it sees something it's never seen before, there's a very good chance that totally won't know how to handle it.
Jesse Levinson (16:44): It might even run into it. It might not even know that it's an obstacle by having Radar and LIDAR, you always know where the obstacles are, even if you don't know semantically what they are, even if your prediction system isn't very good at predicting your behavior, you at least know there's a thing there. So when we saw those for the first time, it wasn't like we were trying to drive into them, but we didn't do a good job of predicting their behavior. So we didn't interact with them as seamlessly as we eventually did once we taught the system what those things were.
Russel Kaplan (17:11): That's really interesting. So when you expand to San Francisco and you encounter petty cabs, you end up making some changes somewhere in the stack that helps you generalize to that scenario. So my next question is you run through that process 100 times, 100 new geographies. How do you prevent regressions on petty cabs?
Jesse Levinson (17:30): Yeah. Great question. So if you don't really try hard to who prevent regressions, you'll get regressions or whether you want to or not. Fortunately, there's a cure for that and the cure is lots of simulation and lots of log file tests. So you can build scenarios in simulation where you just say, "Hey, we're going to run these 10,000, 100,000, 1,000,000 scenarios every time we change our software and we're going to get quantitative data on what we got made better. And if anything, what got worse." But you also do the same for log file test, and you do the same for metrics. So for example, you build a data set of all the different types of categories you care about, and for a given version of your perception system, you can see what is my precision and recall on all these types of objects, occluded and unoccluded.
Jesse Levinson (18:18): And so we don't just throw those away once you've solved some part of the city, plus we never say we've solved anything. We just try to make it better and better and better. So you can always make a system have even better precision, even better recall, even if it's already superhuman level. And so we never give up, we never say we're done. We just try to keep making it better and better. Fortunately, as we get more data, we're also growing the scope and size of our supercomputer cluster, which means that it doesn't get slower and slower to train and test these networks. If you don't grow that infrastructure, then every time you get more data, it's going to take longer and longer to train and test your network. So you have to balance all of that.
Russel Kaplan (18:57): So you mentioned both using simulation and saved logs of real encounters of your autonomous vehicle systems. I'm curious when you encounter those test cases that require some closed loop evaluation like a planner or you're predicting other agents in the scene. Do you have any approaches where you can use those historical real world logs or is that usually solved by making sure it's represented in your simulator?
Jesse Levinson (19:25): Yeah, that's a really important question. So one of the challenges with log file tests is you just playing back prerecorded data. And that's great, because it's real data, but the problem is that it's static data. It's data that happened once upon a time. And the problem there is if you change your vehicle behavior, because you've updated your algorithms, then as the vehicle tries to do something new, all that pre-recorded log file data is all of a sudden obsolete. Because for example if you start moving a little bit faster in a simulator, but all the other agents around you are based on what happened in real life, in the history, then that breaks. So what we have the ability to do now, and this is very powerful is we can instantiate a simulation from any moment in time where you basically take a log file and then you say, "Okay, at this timestamp, we want to turn it into a simulation."
Jesse Levinson (20:19): And then all the other agents in the scene become what we call smart agents. So we imbue them with their own kind of mini planners. We can then simulate what all those other smart agents would do. And if we want to, we can even resimulate the sensor data itself. And so that gives us the best of both worlds. You get the chaos and whatever's going on in real life of an actual log file, but you get the closed loop iteration ability that you can only do in simulation.
Russel Kaplan (20:53): That's really cool. Yeah. That ability to flip on the simulation switch in the middle of real world data sounds extremely powerful. You mentioned the key ingredient here is your computing infrastructure. Yeah, I guess, can you talk more about the investment there and how you've scaled it up so far and what you anticipate that being in the future?
Jesse Levinson (21:14): For sure. One of the most important ingredients to being able to make meaningful progress on this problem is having that computer infrastructure and there's the hardware and IT side of things, but there's also the software infrastructure that you run on top of that because you want to make it really easy for your developers to run giant large scale jobs. And I know you guys are thinking about that a bit at scale as well.
Jesse Levinson (21:38): That's something that we've been focusing on at Zoox for many, many years, because when you have that much data, when you run around that many simulations, every time you change your code, that's not something you can do on one developer's personal machine. You want to be able to spin up hundreds or thousands of instances quickly. So what we've done at Zoox is we've built our own supercomputer and we're up to several thousand GPUs right now, which is a pretty serious computer.
Jesse Levinson (22:01): I mean, there are obviously bigger computers out there, but it's no joke. And our developers all get pretty instant access to that. Sometimes during busier times, there's a little bit of a queue but we've also built some wonderful developer tools to make it as easy as possible trying batch jobs or map produce or anything they might want to run almost at the touch of a button. And then again, that's one of the things that helps us make really rapid progress. We also do cloud bursting.
Jesse Levinson (22:28): You want to build a supercomputer that's so big that no matter what you might want to use it for has enough capacity that are really expensive. It's also inefficient. And so what we do is we use a mixture of our own on-prem cluster and AWS, so that if we really need a bunch of extra nodes and we want to try something, do a big experiment, or we just want to answer quickly, we can fire up a whole bunch of machines on the cloud and balance it between the two clusters.
Russel Kaplan (22:57): Really, really interesting. So you mentioned earlier in your answer that part of this investment in these techniques is what helps you continue to make progress on every edge case, as opposed to just squashing one edge case at a time. And so I'm curious, you mentioned you're always improving and clearly there's going to be room always to improve with autonomous vehicles. But how do you know when your system is reliable enough to kind of move into production? Like talk to me about that chasm between experiment and reality.
Jesse Levinson (23:30): Yeah. That's the million dollar question not just for Zoox, but for the industry is how do you decide when is safe enough? That's maybe the problem that I personally spent the most time thinking about over the last three or four years. Because again, you can have a bunch of metrics, you can make these ones better. Maybe these ones stay the same. Maybe this one gets a little worse but when is it good enough? And how do you decide? And so at Zoox, we have several different ways of looking at that. And the answer is it's only good enough when all of those check marks are checked. One way of looking at it is systems engineering. What are the best practices from all the different industries that have contributed to this kind of thinking?
Jesse Levinson (24:13): So automotive industry, aerospace aviation, medical robotics, they've all solved these types of safety critical problems before in their own way. Nobody's ever done this before though. So we really look at the best insights and best practices from all those different industries. In automotive, you have ISO 26262. So you talk about functional safety. How do you make sure that the hardware you're building and the architecture and the configuration that a hardware is robust and that you can detect failures, right? This is important. You can have some part of your system go wrong. That will always be possible, but what manages is that rare and when it does happen, do you still do something safe? And so there's this whole body of work called functional safety, including things like requirements and traceability and how can we actually guarantee that the thing that we said our system is going to do is what it's actually going to do because not all of our system is some weird AI neural network thing.
Jesse Levinson (25:06): A lot of it is just good old fashioned engineering. That's not easy stuff. In some ways, it's even harder. But that's a big part of what we do. And so again, by following industry best practices, we're leveraging many, many decades of knowledge and really brilliant ideas from many industries. And we're saying, "Look, we are doing these and that's the right responsible thing to do. But that's not sufficient all by itself because you could check all those boxes and say, "Well, we tried our best." You might still end up with a system that's less safe than humans, even if we tried our best and followed all the industry best practices. So nobody wants that and we certainly wouldn't put that on.
Jesse Levinson (25:47): So the other thing we do is we've built some really sophisticated pipelines and I can't go into too many details here, but basically they take everything we've done. All of our simulations, all of our log tests, everything and they basically try to estimate how many miles could the system drive between collisions, injuries and fatalities. I think some companies almost don't want to know the answer to that question. It's like, well, less and more like let's hope that we did a good job and we'll see what happens. And in our view, let's go into this [inaudible 00:26:22].
Jesse Levinson (26:21): We know that those numbers aren't going to be infinity, but humans aren't infinity either. And so what we need to do is we need to set a really high bar for ourselves and we've done that. And we haven't yet shared publicly exactly what those bars are, but they're pretty strict. And there's even some more nuance beyond what I just shared, but at a high level we're doing that in literally every single version of our software.
Jesse Levinson (26:46): It's not just the software, it's the configuration of hardware that the software is running on. And the way we deploy it in an operational design domain. That entire thing it's run through these frameworks we built and we get those numbers and we look at those and until those numbers are where we needed them to be on top of all the other stuff we've done, we aren't going to put these things on public roads without a driver. And we think that's the most responsible way to tackle this problem. It doesn't make it easy, but it's not supposed to be easy. We have a responsibility as developers before putting safety critical hardware on public roads to absolutely do the best that we can and that's exactly what we're doing.
Russel Kaplan (27:25): So tell me about some of those challenges of adapting those best practices from aerospace, from other industries, with a lot of experience in safety critical systems. How do you adapt those systems to the parts of the stack that use machine learning in an extensive way?
Jesse Levinson (27:42): Yeah. Great question. So it's fascinating. When you look at, for example aerospace there's not a lot of consideration, at least until very recently, perhaps given to machine learning or neural networks or AI. And even worse than that, there's just not a lot of thought given to chaotic environments or unpredictable agents, which makes sense. If you're launching a rocket, you're not worried about what are the other agents in the scene going to do, because there aren't any. You're not launching a rocket and trying to maneuver around 33 other rockets that are sharing your airspace with you. Thank God.
Jesse Levinson (28:16): So in some ways, launching a rocket is a much easier problem. Now, in other ways there's a lot of really hard things about launching a rocket that I'm happy we don't have to solve, but they're actually pretty different problems. And yet you can still use some of the same methodologies. Same thing in traditional automotive. If you look at ISO 26262 talking about functional safety, there's not much to be said about sensors and probability distributions on sensor failures especially when it comes to what if your computer vision algorithm fails or what if you can't predict what this other agent's going to do?
Jesse Levinson (28:48): So the industry started to respond to that. There's something called SOTIF, Safety Of The Intended Functionality, which tries to get at that other half of the equation. But again at Zoox we're constantly having to combine, taking best practices that the industry's already figured out, but also innovating and creating some of our own because again, nobody solved this problem before. And so the way in which you combine everything the industry's figured out with both being probabilistic about the world and trying to quantify how well neural networks and machine learning behave, that's really tricky.
Jesse Levinson (29:22): One of the things that helps make that a little bit less tricky is fact that we do have a multi-modality system. If you try to build a system that's just camera only, even if you could get that to work, which I personally don't think is happening for some number of decades, but even if you could, to quantify that it was actually that good would be incredibly difficult because you would either need to drive many, many billions of miles per release or you'd need a simulator that was basically perfect. I don't think either of those things is very feasible even if you have a fleet of lots of vehicles in the world.
Jesse Levinson (29:55): But by building a multi-modality system such that you can quantify the extend to which your modalities are independent or not, unfortunately they're not completely independent. They're actually quite different. What that means is that if your simulator isn't perfectly accurate, or if you don't have some insane number of miles, you can still give yourself a lot of confidence that not all of your modalities are going to fail at the same time in the same way. And therefore you can still build a safe system.
Russel Kaplan (30:21): Tell me more about the multi-modality implementation and how you use those in many cases, complimentary sensors to build a complete AV stack. I'm particularly interested in there's this question of, do you do early fusion or late fusion? And at what point do you take into account the outputs from all these different sensors and combine it into one into one logical stack. So yeah. How you think about that?
Jesse Levinson (30:48): Yeah, it's a super interesting question. We think that the answer is both. So all of the above. Of course, there are pros and cons of early and late fusion. So early fusion basically means, okay, we get all of our sensor data and the best thing you can do in terms of being intelligent about it is let's process it all together. Instead of trying to make all the inferences you can about LIDAR and then make all the inferences you can about camera and all the inferences you can about radar and then compare their outputs at the end. By leveraging all that richness of data early on, you can maximally infer what's going on in the world. That is true. And so you don't want to disallow yourself from doing that in my view.
Jesse Levinson (31:29): However, there are risks. There are risks there because even though you might be really damn smart and super insightful, because the algorithms are great, something might go wrong. You might just... Your early fusion algorithm might just occasionally make a really bad mistake. And that could be again, arbitrarily dangerous. Unless you're driving that thing for billions of miles per release, or you have essentially perfect simulator, you can never be that sure. So what we do is we do some early fusion and we get the best results we can, but we also have safety systems that do more late fusion and they say, "Okay, let's do what we can with LIDAR data. Let's do what we can with radar data. And let's just build a model of where there's stuff around us.
Jesse Levinson (32:17): And that model might not be good enough to drive incredibly smoothly and precisely in every situation by itself, but what it can do is help give us a sanity check that we're not about to run into something. And you don't need that most of the time. It's not doing anything most of the time, but it's able to react really, really fast, because it's simpler and lower latency. You're able to run it on more automotive grade, high integrity hardware. And again, it's a wonderful safety check to make sure that your fancier algorithms running on your bigger machine didn't for whatever reason, do something wrong. And so we look at that hybrid approach again, as a way of getting the most performance we can, but also giving ourself a safety net in case something goes wrong.
Russel Kaplan (32:59): Super interesting. I think one approach that's really distinguished Zoox from so many in the industry has been from the beginning, thinking about not just redundancy from multiple sensors, but actually complete control over the entire AV platform. Very famously Zoox started by designing and building cars almost from the ground up and thinking about that from day one. What are the pros and cons of owning the entire mobility platform when you're trying to build a self-driving car?
Jesse Levinson (33:30): Well, the cons are people tell you, you're out of your mind for several years. That was fun. But honestly, I think over time, more and more folks in the industry and in general are realizing that it's a really good idea. It's hard. There's no question about that, but it's so powerful to own that entire ecosystem end-to-end. And it's powerful in two ways. The first way is when you have a blank sheet of paper and you can choose the shape of your vehicle, the system architecture, where you building redundancy. You can build a dramatically safer system having 270 degree camera radar, and LIDAR on the top four corners of our vehicle. And we've talked about that a lot over the last year. That's a key differentiator in terms of being able to see the environment and see objects behind other objects.
Jesse Levinson (34:21): That's so important in dense urban areas. And you see slowly our competitors trying to get closer and closer to that architecture just because from first principles, it's the best way to perceive the environment. So that's something that we've been doing for years now because we weren't constrained by the limitations and shape of a car early on in Zoox's history. But the other reason why we think it's so important is back to that point about how do we quantify and certify that the system we built is safe?
Jesse Levinson (34:50): If you have to treat your car like a black box and hope that the OEM has done all the right things, then you're missing a big part of that overall safety case. You're missing the ability to get into that hardware and firmware and say, "Look, I understand from a functional safety perspective, exactly how this platform I'm running on works." And what is that interplay between the AI and the lower level systems on the vehicle.
Jesse Levinson (35:19): If you can't do that, if you have to treat something as a black box, it makes the problem much harder. It doesn't mean it's literally impossible. I mean, you can work with a car company and hope that they do that well and that the communication goes back and forth. But as you can imagine, if you want to change something, if you own that vehicle architecture, you just go downstairs, talk to the vehicle team, make a change, test it, validate it. I'm not saying that's easy, but that's something you can do in days or weeks or worst case months. If you're working with a car company, they're great at building cars, but it also it takes a long time to make the cars.
Jesse Levinson (35:50): So if you want them to change something and then you want to get into the firmware and figure out all the failure modes and all the safety monitors you need know on top of it, that can be quarters at best and realistically it's probably multiple years. And so that's one of the things that I think has helped us move relatively quickly in the industry, even though we haven't been doing this as long or with as many people as some of the other companies have.
Russel Kaplan (36:12): I think that's an interesting segue into a question about end-to-end control of the platform and how that might relate to where machine learning fits in. So you talked about this a little bit earlier in terms of what parts of the system does it make sense for machine learning to play a role and maybe what parts are these solved problems already on one spectrum of more end-to-end learning and on another spectrum of maybe just machine learning in a perception stack. Where is that balance for you and Zoox and how has that shifted over time?
Jesse Levinson (36:46): Sure. We're kind of I would say somewhere in the middle there and we're not religious about that. So I do think over time, there is a shift towards replacing more classical systems with more machine learning inversions. I think that will continue for the foreseeable future. But again, I don't think you want to be religious about that. I don't want you to say, well end-to-end is the future. Let's just do that. Okay. You can try that good luck. Probably not going to work and even again, if it does, you probably can't quantify that very well. So that doesn't seem appealing, but you also want to take advantage of the incredible break throughs in machine learning and the performance that can bring. So again, we don't look at it as you must do this or you must do that.
Jesse Levinson (37:26): We look at each problem and some problem individually and say, "Hey, what's the best way to solve this." And sometimes that changes over time. Sometimes we'll take two or three different systems that we had built independently and say, "You know what, let's try combining them together and making that part of it end-to-end." And sometimes that works a lot better. Sometimes it doesn't work better. Sometimes you just have to find out. But again, one of the themes at Zoox is being balanced and almost all the things that we do. And so you won't hear me being an extremist on one side or the other.
Russel Kaplan (37:58): Do you see that basically changing during your path to full production? You mentioned maybe there's different approaches. You'll try different parts of the stack with ML, try different parts of the stack without, and kind of keeping an open mind. What is where you envision things by the time Zoox is deployed everywhere?
Jesse Levinson (38:18): I honestly don't think it's going to be that different from what we're doing now, and we're improving a lot of our systems. But I think architecturally, we have something really special. There are still a couple of components that we're rearchitecting and we're going to be able to plug more machine learning into, but I don't think fundamentally you're going to see anything radically different than what we've been doing for the last couple years, because we think we're on the right track. And again, there's plenty of things that we're improving. But we're not saying, "Oh my goodness, we're totally stuck and we're never going to be able to work our way around us unless we just feed it tons of data and it magically figures out what to do. We're not in that regime right now.
Russel Kaplan (38:55): Totally. And do you see that data acquisition strategy continuing to evolve or maybe you could talk a little bit more about the current status quo. How you think about what data that you collect maybe is worth getting labeled or using for supervision for the stack.
Jesse Levinson (39:12): It's fascinating because I think the system architecture you have also is pretty tightly coupled to how much and what kind of data you need. If you're trying to build, for example, an end-to-end vision only system, you do need a massive, massive, massive amount of data. I still think you'll fail for the next decades, but at a minimum you need an absolute shitload of data and then maybe you'll have a chance. And frankly, we don't have a fleet of hundreds of thousands of vehicles. So we don't have the ability to snap our fingers and get all these wonderful edge cases the next day, which is a pretty cool thing to be able to do.
Jesse Levinson (39:54): On the other hand, we don't need to do that because like we discussed in our simulation environment, we can simulate an arbitrary number of edge cases and our simulator doesn't have to be absolutely perfect because we have separate sensor modalities. And so with camera, radar and LIDAR combined, you don't have to have seen every possible thing to give yourself really, really high competency. You can still handle it safely if it happens. So in my view, that's a much better way to solve the problem, especially in this early to medium days of autonomy.
Russel Kaplan (40:31): Do you see a path for the data that you do collect in the real world, being able to in form updates to the behavior of those simulated agents more directly, because it sounds like this is a really essential part of the stack?
Jesse Levinson (40:46): Yeah, for sure. I mean, anytime we get new data, we are able to train our models and those are models that do prediction. One of the cool things about prediction is you almost get data for free in this sense that if you want to know what's going to happen in three seconds, you can just rewind three seconds ago and then say, "Well, okay, what happened three seconds later?" So that's really cool. But you can also use that to inform how your smart agents behave. There's really almost no limit to what you can do with that wealth of data. And by the way, even if you don't have 100,000 car fleet, you still get a massive amount of data with hundreds of vehicles and you still have to be selective about what of that data you keep and you do things with and you care about.
Jesse Levinson (41:32): I think one of the misconceptions is if you train a car model on 80 million cars, it's going to be way better than if you train on eight million cars and that's just not true. I mean, eight million cars is going to be a lot better than 8,000 cars, but any machine learning algorithm we've ever built asymptotes in performance after a certain point and you don't need insane amounts of data. So the really only thing you get with insane amounts of data is you do get more edge cases, which again, if you don't have good simulation capabilities or if you don't have multi-modality sensor systems, you do need that. But if you have those things, you don't necessarily need that much of that.
Russel Kaplan (42:12): Really interesting. So on that path to full production, you mentioned the approach looks sound, there's just a lot of grinding out the details left to get things deployed. Are there any shortcuts that you've found? Anything that can move up the timeline that's been able to save you a lot of time and getting there?
Jesse Levinson (42:31): Yeah, probably the best shortcut is teleoperation. And the idea there is if you try to solve absolutely everything that could happen with an AI system, you're almost asking yourself to build generalized AI. And just like I said about computer vision, taking some number of decades to truly solve it, probably even harder to just hold general AI. And so our view is you have to build a safe system that doesn't rely on teleoperation.
Jesse Levinson (43:01): You don't want to be relying on that network or the latency or human paying attention to make sure you don't hit anything, but it's actually okay if once every large number of miles, the vehicle sees a situation that semantically, it's not entirely sure what to do. And a human can take a look at the sensor data and help the vehicle navigate that. And so we've built a lot of really powerful tools to do real time collaboration between the AI on the vehicle and the human in the command center.
Jesse Levinson (43:28): And what's so powerful about that is again, 99 point something percent of the time, the AI doesn't need any help and it's confident about what to do. That means that the ratio of vehicles to humans is a very big number. And as the AI gets more and more competent, that number just continues to increase. But I think it's going to be a very long time until it gets to infinity. Meaning that no matter what could possibly happen in the world, the AI always knows the best thing to do.
Jesse Levinson (43:54): And the great thing is it doesn't have, have to get to infinity. It can get to 100 and then 1,000 and then 10,000 and someday you might have 1,000,000 of these things running around and maybe you only need 100 people paying attention. But you're competing with that one-to-one one driver per vehicle, which is a pretty sad state of affairs. And so that I think is probably our best shortcut.
Russel Kaplan (44:17): That makes a lot of sense and it's super interesting to hear you lay out the strategy like that. Jesse, thanks so much for taking the time today and really enjoyed the conversation.
Jesse Levinson (44:24): Likewise, thanks so much for having me.