Since the autonomous driving industry got its start at the 2005 DARPA Grand Challenge, numerous companies have trained and tested machine learning models to control every aspect of autonomous driving. The market leaders are continually updating and improving their ensembles of computer vision and control systems models to build safer, more efficient, and more adaptable autonomous vehicles.
Similarly, the world of pure computer vision has also advanced by leaps and bounds, first with techniques such as SURF and SIFT for object detection, now surpassed by deep neural networks and transformers. Just seven years ago, papers such as DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time (Richard A. Newcombe, Dieter Fox, and Steven M. Seitz, 2015) explained the state of the art of 3D mesh generation from a series of 2D images. But today, as with many other algorithmic challenges, deep neural networks have surpassed the performance of these earlier approaches.
In mid-2020, a few super-sharp UC Berkeley researchers determined that by using five coordinates—x, y, z, \alpha, and \phi—as well as by training a network on sequential “radial” frames of a 3D object using synthetic data, a model can be trained to produce a 3D object from a series of images. This is more sophisticated than simply asking a network to “imagine” or synthesize additional frames as the camera moves around a stationary object.
With NeRF (short for “neural radiance fields”), a model also predicts the view-angle-dependent radiance (or “brightness”) of a specific point in the rendered image. This point is where the model is trained to generate an image close to the CGI-rendered version of the ground-truth 3D model, subtracting color deviations to yield the error upon which to train the model.
Figure 1: A visual comparison of NeRF versus other 3D reconstruction techniques. Source: Ben Mildenhall, et al., “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” 2020
NeRF also focuses a fair bit on efficiency, training only on images that are scored to contribute substantially to the final model. For example, if a large, flat surface obscures fine geometric detail from the camera or viewpoint, images at that and nearby angles can safely be omitted from training.
The biggest victory for NeRF, perhaps, is that a radial analysis eliminates the need for voxel-based representations that take up too much storage and produce models that are impractically large.
NeRF appears useful for generating 3D models of monolithic, individual objects in front of the camera, but can it be useful for synthesizing training data for autonomous vehicles? The perception team at Waymo decided it made sense to find out, particularly if generating an entire 3D world with NeRF is simply a matter of training on a larger cluster of compute resources, such as a Google data center.
Enter Block-NeRF, an extension of NeRF that synthesized 3D “blocks” and then “stitched” them together to render an entire San Francisco neighborhood. Researchers created this from roughly 2.8 million individual images, captured on Waymo’s test vehicles (typically Jaguar I-Paces with eight roof-mounted cameras).
A video walkthrough of Block-NeRF synthetic output representing a San Francisco neighborhood. Source: Waymo
The authors point out that autonomous vehicles encounter widely varying physical environments, including moving obstacles and weather. AI has been remarkably effective at identifying and locating pedestrians, emergency vehicles, and many other recognized “objects” of importance to the “ego” vehicle.
Because an autonomous vehicle must reliably handle unusual combinations of these scenarios (perhaps an ambulance driving through snow), it is incredibly important to train the ML models that run on these vehicles on large datasets consisting of both real and synthetic data. But is the synthetic data of high enough accuracy to be useful to the trained algorithm?
It’s a challenging problem to robustly re-create an elaborate 3D environment from simple 2D stored imagery, but breaking down the problem proves to be effective, according to the Waymo researchers who wrote the article. While SIFT was once seen as a workable approach to the problem of creating large-scale virtual worlds, the storage cost of voxels at that scale proved to be prohibitive.
Thus, the researchers used the more traditional mesh-based, polygonal rendering approach that’s often found in both the gaming and CGI rendering arenas.
The paper explains the methodology for matching both images and geometries from adjacent synthetic “scenes,” such that the final rendered output can encompass an entire neighborhood. Camera pose estimation is learned and tracked as part of the model training process, and time-of-day becomes a tunable parameter, as shown in the demo video.
As a result, the model can reasonably guess what a specific scene will look like at night, even if the images captured of that scene are daytime photos. Thus, the researcher’s extension to NeRF can perform a time-of-day modification akin to a “style transfer” modification, from day to night or vice versa. This can be incredibly valuable to simulate additional permutations of scenarios that a self-driving car might encounter.
Waymo has collected one of the larger autonomous vehicle datasets because it has had multiple generations of vehicles on the road over the course of over a decade. Thus, the researchers explain, any attempts to synthetically generate an entire neighborhood from common, existing open-source datasets such as KITTI, Cityscapes, and Berkeley DeepDrive are unlikely to yield robust results because they simply don’t contain the same amount of per-frame data as what Waymo is currently capturing.
That said, the authors do identify ways to improve on Block-NeRF in the future, likely generating worlds with more detail at distance. They propose that this might be possible with higher compute requirements, as was shown in NeRF++. At the neighborhood level, though, this may be cost-prohibitive or may need to rely on future generations of efficiency-improved hardware and software.