Sign in or Join the community to continue

Embedding Synthetic Assets to Train AI Models

Posted Oct 06, 2021 | Views 2.6K

# TransformX 2021

# Breakout Session

Share

speaker

Dr. Jonathan Laserson

Head of AI Research @ Datagen Technologies

Dr. Jonathan Laserson is the Head of AI Research at Datagen Technologies, and an early adopter of deep learning algorithms. He did his bachelor studies at the Israel Institute of Technology, and has a PhD from the Computer Science AI lab at Stanford University. After a few years at Google, he ventured into the startup world, and has been involved in many practical applications of ML and deep learning. Most recently, at Zebra Medical vision, he led the development of two FDA-approved clinical products for the detection of breast cancer in mammography images, and pneumothorax in x-ray images. Today he is interested in the space where graphics meets AI, for the sake of generating photorealistic synthetic images.

+ Read More

SUMMARY

Dr. Jonathan Laserson, Head of AI Research at Datagen Technologies, is an expert in the field of photorealistic synthetic images. He shares how Nural Radience Fields (NeRF) can be used to generate a nearly infinite number of synthetic assets to train AI models. Dr. Lasserson also explains how synthetic objects can be represented in a latent space where features can be perturbed to modify the shape and texture of each asset. Join this session to learn how NeRF can accelerate the curation of data for your computer vision use cases.

+ Read More

TRANSCRIPT

Nika Carlson (00:16): Next up, we're excited to welcome Doctor Jonathan Laserson. Doctor Jonathan Laserson is the head of A-I research at Datagen Technologies, and an early adopter of deep learning algorithms. After a few years at Google, he ventured into the startup world and has been involved in many practical applications of M-L and deep learning. He led the development of two F-D-A approved clinical products for the detection of breast cancer and collapsed lungs in medical imaging. Today, he works on generating photorealistic synthetic images. He did his bachelor studies at the Israel Institute of Technology and has a P-H-D from the Computer Science A-I lab at Stanford University. Doctor Laserson, over to you.

Jonathan Laserson (01:05): Hi, guys. My name is Jonathan Laserson. I'm very excited to be here in the Scale A-I conference, I'm super excited to be with so many Stanford colleagues, and especially my P-H-D advisor, who is also speaking later in this conference. You should go see her talk as well. Again, I'll introduce myself. My name is Jonathan and I'm the head A-I researcher of Datagen Technologies. Before I start going into the talk, let me ask you this short quiz. You can see here, a movie. Two movies, actually, of two different chairs. One of the movies is real and the other one is synthetic. Let's see if you can tell which one is the real chair and which one is the synthetic one.

Jonathan Laserson (02:08): Maybe you notice a little bit of flickering artifacts over here, and with the AC socket you get a hint. The synthetic chair is actually the one on the left, and this is the real one, but I hope you can agree with me that the synthetic chair here, the synthetic video looks highly photorealistic. This is exactly what we're trying to do at Datagen. Datagen is a pretty young company, three years old, with offices in Tel Aviv and in New York city. We are in the business of generating synthetic data for the purpose of training A-I algorithms on that data. We specialize in people and people interacting with environments, especially user costume environments. For example, those in cabin images of people driving in cars, this is a custom user requested environment. Also this robot right here.

Jonathan Laserson (03:10): Maybe someone wants to train a robot to clean and fold all the clothes in the house. That robot needs to navigate in rooms that look like this. This is a synthetic image of a very, very messy room. Of course, creating a real data set of this level of messy rooms is going to take forever, so creating this synthetically could help speed up the process by quite a lot. If you create this synthetically, you have complete control of the scene, you can even take the same scene from a different angle, or perhaps from a different light perspective. Of course, you can do this with real data so easily. Control is one thing and the label is another. With synthetic data, you can get a really high quality labels on a pixel level.

Jonathan Laserson (04:09): This is for example, a depth map. We have a normal map, and of course, the optic segmentation map. This is accurate to the level of a single pixel. This is not hard to do on real data, this is actually impossible to achieve if you're trying to get this level of labels on real data. Control and labels are definitely important, but the real important thing in synthetic data sets is the variety. We have to generate data sets that have a huge variety of things that appear in them. For example, here you're seeing examples of indoor spaces that were automatically generated by algorithms. These spaces are filled with assets. Assets are things like all the furniture, like chairs, tables, the bed, every item that we put inside this indoor space to make it what it is. That's an asset. We need to make these assets to have the highest possible bias so that the models that our user trains on would be able to learn as much as possible from our data.

Jonathan Laserson (05:28): In order to do that, we collect a huge library of assets. We have over one million assets library, and each one of these assets is a really high quality structure, for the way it looks like. We have one million assets or more, it's constantly maintained, and we keep on adding more assets to it. These assets are generated by artists. Here are more examples of them. We're going to talk a lot about chairs in this lecture, here's a bunch of the chairs that we fill our spaces with. These assets are generated by three D artists using a dedicated software. When they generate these assets, they're using their tools, and usually it's based on this mesh structure. This mesh is something that represents the shape of the object, and this here is the texture of the objects. This is like a gift wrap that you work around the mesh, and this would give you the asset on the left.

Jonathan Laserson (06:32): Each one of our assets is generated by different artists and they don't have the same structure. They don't have the same topology. This mesh might have a different number of vertices than this mesh, they're not the same structure at all. Also, when our three artists generate a mesh, they don't just generate a single shape, they do it in a smart way. They add some parameters that allow them to use the shape of the asset that the generator as the prototype, so they can maybe make the chair a little wider or change the color. When they encode the shape, they actually encode a family of assets, not just a single one. That's one way to get to over one million assets. Okay. Having so many assets actually has some issues, because when they generate these assets, they don't necessarily write down every single attribute that a user or an engineer might care about.

Jonathan Laserson (07:34): For example, we might want to see only chairs that swivel in the scenes that we're generating. How do we know if a chair is a swivel chair or not? Unless the three D artists explicitly wrote it down in the metadata for the assets that they generated, there's no way for us to know, except opening the Blender file, that's the software that they use to generate these assets, and actually look at the shape and tagging it. This is, of course, a laborious task that we don't want to do. Another question might be to have a chair without arm rests, or to know how many legs the chair has. This is something that we might not even know in advance, and even if we didn't know in advance, there are so many issues that could be synchronizing all the different artists.

Jonathan Laserson (08:24): Of course, every time you want to add another attribute, you have to do this all over again for all the assets that you already have. This is, of course, not scalable. What we would like to do is to convert all our assets, to put them all in the same space. What do I mean by that? I went to assign a vector for every one of our assets, and it doesn't have to be all this, it could be all the assets from a specific category. Let's say chairs or, in general, things that you sit on. Maybe one half of this vector would be encoding the shape of the assets and the second part, encoding the appearance of the assets. The material, the color, the texture, that part. When we code all the assets as vectors, then we don't need to open this Blender files anymore, and we can actually compute all our attributes based on the vectors, which is a much easier task, because we can train S-V-Ms or linear classifiers on top of these vectors. This is a much easier task to do than opening Blender files with meshes of different topologies.

Jonathan Laserson (09:32): We want to put all our assets in the same space, and how do we know that these vectors actually capture all the necessary information from the three D assets? We need to be able to generate something like this animation on the right. To be able to generate this animation on the right, basically render the three D object from every angle that we want. The vector probably has all the information that is required to describe these three D assets. This is our goal, to replace the mesh presentation with this vector that can encode this three D structure of the objects.

Jonathan Laserson (10:10): In order to do this, I'm going to take a ride on a very important breakthrough that happened in computer vision and graphics in the last one and a half years, and that's the NeRF, neural radiance field work. In the neural radiance field work, they did something very similar. The problem that they had to solve was to look at... There's the three D object and you want to know the structure of this three D object, you want to have a three D model of it. All you can do is take photographs of this three D object from different directions. If you're seeing this dome here, maybe you are taking 40 images, or 80 different photographs of this drum set over there. The goal is to be able to create a three D structure, such that you can take a picture of the structure from any place that you want, so that you can create an animation like the one on the right.

Jonathan Laserson (11:05): The animation on the right shows you a view of the drum sets from angles that weren't in the original photographs that were taken off the drum set. Okay. This can only be shown like this if the model that they use for the three D structure was able to actually encode the three D structure of the object. The three D presentation used by NeRF was radical in the way they did it, it's actually a neural network. They're not a very complicated neural network. The multi layer [inaudible 00:11:37] drum set with nine fully connected layers was pretty much it. This neural network works in a completely different way than how the mesh presentation did. The neural network gets this input, a single point in three D space. Three numbers. It firstly gets view direction, which I'm not going to talk about in this talk, but having this point in three D space, the network tells two things about this point.

Jonathan Laserson (12:04): First, it gives us the sigma, that's the output density. Basically tells us whether there's an object there or not. If the point is in thin air, there's nothing there, means sigma was going to be close to zero. If the point is on the object or inside the object, that sigma is expected to be something closer to one. The other thing that will tell us about the point in space is its color, the R-G-B coordinate of the color. Basically, five numbers go in and four numbers go out. Notice what this presentation doesn't tell us. It doesn't tell us anything about surfaces, it doesn't tell us anything about normals, it doesn't tell us anything about the number of objects in the scene, or the class of the object in the scene. If we wanted to know these things, we'd have to probe this network and figure out what the bandwidths are by taking into account queries over sigma here.

Jonathan Laserson (12:57): That's not to be a trivial way to encode the objects, but it is a really good way to allow for rendering objects. What is rendering? Rendering is the ability to take a three D object, like the one of them here, and present it in a two D view, that's if you're taking a virtual photography. How do you render a virtual object? Let's say this is your camera and this is a pixel in the image that you want to render. In order to know the figure, all you need to do is figure out the color of that pixel. The way it works is that you send the ray from the camera, through the pixel, onwards towards the apple, and then you sample points along this way. What's going to determine the color of these pixels is exactly the two things that the NeRF neural network is returning.

Jonathan Laserson (13:45): Whether there's an object here and, in general, the density of the object means the transparency of it. If you can see through it or not, and then the color. These two points are not going to affect the color that these pixel are going to have, but maybe this one, this is on the apple is going to have a very strong effect on what this color here is going to be. Of course, if this object was transparent and you can see through it, then also the points that are would influence this object. This is exactly what the NeRF neural network provides. It gives it this way for the network to return values so that you can render these rays and render these virtual images. How do we make the virtual images that are going to be returned rendered by the NeRF network, similar to the real photographs?

Jonathan Laserson (14:37): This is what we'll use in the neural network mechanism. Right now, at the beginning of the process, the neural network would return rendered images, garbage images. In order to make these images be of a concrete object, we need to make sure that they look like the real photo when we compare them to the real photos. We compare this pixel here using the rendering process I showed you before and we compare it to the real pixel in the real photo that was taken from the same angle. We want to make this pixel's color as close as possible. The difference between the rendered pixel and the real pixel is going to propagate back through the network, so that the next time the network generates this pixel in the rendered image generator, it's going to be much closer to the real photo.

Jonathan Laserson (15:26): Of course, I'm not going to do this just for a single photo. The way it's going to work is that I'm going to do it to all these 40 or 80 photos, because only then you'll have enough information to capture the three D structure of the object. If you have two different images from different locations and you send rays in order to know what this pixel here and what this pixel here. Notice that both these rays require the network to answer what the color of the X-Y-Z point over here is. The X-Y-Z point of view needs to return an answer that is consistent both with this image and that image. This is what makes everything work and what the network converges, the rendered images look just like the original images. And when we do that, we can actually look at the network and figure out what's the three D structures.

Jonathan Laserson (16:17): Okay? So in our walk, we actually going to take it a little bit further. We don't want the network to encode just a single object in the network, in the weight of the network. We want it to encode a family of objects. In this example, we wanted to encode not just a single apple, but basically every apple that we have in our collection, by the way, these great slides borrowed from a Facebook ad team. Eh, they used it in their tutorials for Nerf. And I appreciate the willingness to let me show these slides, this loop. So we want to extend the, sorry. We want to extend the network to represent not just the single FL, but represents all the family. And in order to do that, we add another input to the neural network, which is the code that we are going to associate with each app.

Jonathan Laserson (17:17): So how so? So this goat is the only thing that's going to differentiate one episode from the next, right? The network needs to tell us what's happening in X, Y, Z, but how is it going to know if it's going to be [these 00:17:30]. It has to figure it out, because it had to assign a code for each individual apple. The code is not given by us, it's learned by the network in the same way the parameters of the networks are learned. Okay? It's similar to Word2vec, if you think about it. We might call it assets to vec. In Word2vec, the vector of every word is learned such that the network can use that vector to predict the words around it. Here, it's the same thing. The network is actually the one feeding the content of each code.

Jonathan Laserson (18:06): We're allocating space for the network to fill in the code that is going to put for every apple, so that it can render the images of that apple accurately. Each apple is going to get its own code and this is our goal. Our goal is not to generate new assets or generate assets realistically, that's not our goal here. Our goal here is to be able to assign a latent code for everyone for assets. Zooming in on the network, the network has these eight layers of fully connected network, and at the end, two branches. One branch is giving the sigma output, the density, and the other one is the color [alt 00:18:48], which just three numbers representing the color. As I said before, we add the slate and code to the input. This is the three D location, by the way.

Jonathan Laserson (18:57): This gamma is just a position of encoding of these three numbers. It's not a learning function, it's a deterministic function that is commonly used. What we're doing here is an idea that we actually took from the draft paper that was published in a recent C-V-P-R and we're breaking the code into two. If the code was originally 64 elements, we're giving half of it here in the input to the entire network, and the other half we're giving here. What's special about this other half of the network is that this other half is only accessible to the color branch. The sigma only depends on this part, the one that was given here. The network has a very strong incentive to decompose the information that is going to save in the latent code. Everything that's going to be here is going to affect sigma, everything that's going to be here is only going to affect the colors. Hopefully, this would give us a natural decomposition of shape and texture.

Jonathan Laserson (20:08): Now it's time to see how the results look on our own assets. In order to do this, we need to get virtual photograph of our assets. Remember our assets are in mesh format. We have synthetic images, so we have a complete freedom to take any image that you want from our synthetic assets. Every one of our specific assets, we're going to take something like 18 images from these three different levels looking into the objects. This is for example, the virtual photographs that we're taking of one asset. These are rendered with PyTorch3D, it's a tool that we use to do this type of work one on graphics inside PyTorch. We need to do this for every one for assets. These animations are basically produced from the original meshes.

Jonathan Laserson (21:00): These are not the reconstructions that NeRF did. Now we're going to see the reconstruction that NeRF did. This is a single neural network that is building this animation. The only thing that makes this animation become different than that animation is that the network, the nerve that was trained over 200,000 iterations, a few hours on a single G-P-U... This shape had a different code than that shape. NeRF gave these two shapes two different codes and you can see that the quality of these shapes is not like the original images. We can't use it, but we can definitely tell the difference both in shape and in color of all these different chairs.

Jonathan Laserson (21:49): This is the same for the class of tables. These are, again, all reconstructed from the same neural network that encoded all these tables. The only thing that tells the neural network that one table is different from the other, is the code that it gets as input. I mentioned earlier that we decomposed the latent code of the texture and the latent code of the shape. On the diagonal here, you're seeing five of our original assets. These are assets that actually were in the library of assets that we trained our model on. Every chair that you're seeing in the off diagonal is a chair that didn't exist in the original collection. How did we generate these chairs? Simply by splitting and merging the codes from these five chairs. The chair in row I and column J takes the shape vector of chair I, and the texture vector of chair J. By doing this, we get this natural style transfer. We can only do this because we decomposed the two halves of the latent space to the texture part and the shape part, this gives us this natural style transfer.

Jonathan Laserson (23:18): We can explore our space of shape, light, and dimensions by doing P-C-A. We can see that some of the chairs become longer or shorter, having armrests versus having padded armrests, make it more like a sofa, or less like a sofa. The same principle component analysis can be done on the color space. Here we're seeing, not only effects of coloring, but also effects of lighting. For example, seeing here, we can see that the shading is also included in this texture, light, and space.

Jonathan Laserson (24:00): How does the space actually look like? This is a [test we brought 00:24:03] of the shape space. These are 32 dimensions compressed into two dimensions. We colored the classes of the chairs into these different colors. You have armchairs, office chairs, dining chairs, and sofas. Clearly clustered, but even within a cluster, you can see, for example, we have these two types of armchairs. One of them with wooden armrests, and the other one with extended armrests. You can see very clearly that you can... If our user wanted chairs with an armrest, it would be really easy to find them by sampling points from this space. It's going to be really easy to train a classifier that will figure out whether a vector would have a wooden armrest or a padded armrest.

Jonathan Laserson (24:50): Here you're seeing two types of sofas. They're completely different, even though they're all originally from the same class. They're completely different sofas, and they're naturally clustered separately. Even here, in the office chairs, you're seeing two types of chair. One of them is a one part seat, this is a two part, and you can see that there is a good difference between them. We can also use the latent space to do similarity search. If we are interested in chairs that look like these chair here, and want to replace them with a similar shape, we can do the search by finding assets that are nearby in the shape space. These are the assets that are the closest to the assets on the left. The same thing in the texture space, if you want to replace a chair with a chair that has a similar style.

Jonathan Laserson (25:52): To summarize, this work showed us how we can put all our assets in the same space and treat them as vectors, so that we can gain insights on our assets, not by opening up very heavy Blender files and looking at them manually, but by having operations on top of these vectors. By training simple classify on top of these vectors and then applying them to over a million assets. This presentation is not yet ready to generate new assets, but hopefully as the field progresses, we're going to be actually able to generate assets' adjustment using these codes. Even before we do that, we still have a lot of uses for this type of cataloging our assets. We can, of course, add these tags and meta-label, we can search by example and find similar assets, and we can even identify sparse region in our shape space and ask our artists to generate more assets in this sparse regions.

Jonathan Laserson (27:00): Also, if you look at the image of the chair that was taken in the real world, we can find the latent vector that is closest to the picture that we got from the real world, that will allow us to provide an asset that is the closest to the asset that we took the picture of. One of our goals in Datagen is to actually help our users build better models, if the user's model actually runs on our assets, then we can use this latent presentation of these assets to figure out what's in common to all the images and assets that the users want to [find 00:27:41]. This can give us a great insight on the users' weak spots. This would have a huge value for the user, if we can figure out the type of chairs that the user algorithm doesn't work well. Of course, because we generate synthetic data, we can generate exactly the chairs that are needed to make the user model better, once they train on them.

Jonathan Laserson (28:07): As I said before, in the future, you can imagine that we will be able to build our entire synthetic data pipeline only using this latent codes. We have a latent code for the object, for the shape, for the texture, for the pose of the camera, for the pose of the object itself. We'll have a latent for the background and we can even do videos and add the latent for the dynamics of the scene. These things are not science-fiction, they're already done today and you can actually generate a small video just by changing the latent code of the objects in the video.

Jonathan Laserson (28:47): If we're taking it a little bit further, once we're able to do that, we know what I'm going to need to have... Once we do that, then the objects that we're going to generate are going to be generated in a differential way. When we do that, we can actually add the data generator and the user model, we can actually train them together. Think about being able to train the user's model and, at the same time, generate the examples that would benefit the training of the user's model the most. This is the future that we're seeing at Datagen Technologies and I think that the community is going to go there. We're very excited to make our users' model better by incorporating this way of generating data with their train models. I hope you enjoyed the talk, thank you very much, and enjoy the rest of the conference.

+ Read More

Watch More

Emad Mostaque (Stability AI): Democratizing AI, Stable Diffusion & Generative Models

Posted Oct 23, 2022 | Views 6.5K

# TransformX 2022

# Fireside Chat

What's Next for AI Systems & Language Models With Ilya Sutskever of OpenAI

Posted Oct 06, 2021 | Views 23.1K

# TransformX 2021

# Fireside Chat

Building Trust in AI: Testing and Evaluating Large Language Models (LLMs)

Posted Sep 28, 2023 | Views 5.6K