At TransformX, we brought together a community of leaders, visionaries, practitioners, and researchers across industries to explore the shift from research to reality within Artificial Intelligence (AI) and Machine Learning (ML).
In this session, Dr. Li shares how vision is critical for first perceiving the physical world and then interacting with it. She explores how recent advances in AI research help machines perceive the environment around them and then engage with it, to perform both short-horizon and long-horizon tasks.
Dr. Fei-Fei Li is the Sequoia Professor of Computer Science at Stanford University and Denning Co-Director of the Stanford Institute for Human-Centered AI (HAI). Her research includes cognitively inspired AI, machine learning, deep learning, computer vision, and AI in healthcare.
The pace of innovation in AI is accelerating, but there remain fundamental capabilities that machines need to develop further. These are critical to performing tasks that may seem trivial to a human, like folding laundry or clearing the table, yet are beyond what can be easily automated today. While most robots today can perform short-horizon, skills-based tasks, more work needs to be done to train robots that can execute complex, long-horizon tasks.
By understanding the evolutionary origins of human perception and action, we can understand how to ‘evolve’ machines to perform these more complex tasks.
Dr. Li presents vision as key to understanding intelligence and building intelligent machines. Vision is fundamental to how we perceive and interact with the world. So it must also be, for machines. Vision is the ‘cornerstone of intelligence’, according to Dr. Li.
Technically, machines “see” images in the form of binary-encoded digits. However, to make sense of these binary ones and zeros in order to understand an environment takes a lot of intelligent heavy lifting. Within the field of object recognition, Dr. Li refers to a foundational dataset known as PASCAL VOC that contains objects in twenty different datasets, i.e., dog, man, etc. However, there are significantly more than twenty object classes in the real world. For machines to identify everything a human can, they need to be trained with data that closely resembles real-world class distributions.
It’s not enough to simply see an object. Once we see the world around us, we must reason about what we are seeing, before interacting with our environment.
According to Dr. Li, humans exhibit rapid serial visual perception (RSVP), which helps to easily process novel or unique objects in an image or scene. For example, in a forest of trees, you might easily detect a person.
There is another layer to understanding. It’s not always enough to understand what we are looking at. Often we also need to understand what might be happening at that point in time. Often this means answering the question, who is doing what to whom?
Dr. Li uses the example of a scene with a person and a llama. A machine or robot might correctly perceive and identify these entities. But does it also matter what we understand of their relationship to each other at that moment in time? Was the human guiding the llama, or was the llama chasing the human? Either scenario might require a different response from a robot.
To solidify visual understanding, you also must understand the relationships between the objects you perceive. For machines to comprehend these relationships, we have to encode them. Images are full of relationships between objects. To illustrate this, Dr. Li and her students created the Visual Genome dataset. In her presentation she showed a scene graph that shows a bride feeding a groom. The two people are the focal point of the image, but this image, according to Genome’s scene graph, includes:
This one example shows the complexity of relationships even in the most basic image of two people. How can we use these for computer vision? Scene graphs can be one way to predict new visual relationships and novel object recognition. This can extend to understanding videos as well, which Dr. Li classified with an Action Genome.
A passive understanding of the world isn’t enough. It’s not enough to simply see the world.
We must be able to learn through interacting with the world, in much the same way that a child learns through play. This “action” part of visual understanding also helps with robot learning. Learning through interaction is key to executing complex tasks.
Dr. Li explored how robots can learn by exploring their own environment, just as a young child might. To plan complex tasks, machines often need to know about an environment in a way that most often comes from directly exploring and engaging with that environment autonomously.
Robots are already prevalent in everyday life, from floor-cleaning to medicine-carrying bots in hospitals. These are generally more suitable for skill-level and short-horizon tasks, such as opening doors or grasping objects.
An aspirational goal is to help robots perform long-horizon tasks.
According to an MIT paper, “Long-horizon manipulation tasks involve joint reasoning over a sequence of discrete actions and their associated continuous control parameters.” At a high level, this means using reason to plan and execute a longer sequence of steps. Examples of long-horizon tasks for robots are assembling furniture, organizing an office desk, sorting objects into boxes, or cleaning up tabletops. By further developing explorative learning, they can learn to fulfill these long-horizons tasks.
Robots aren’t there yet. To help them get there, Dr. Li suggests an ecological approach to perception and robotic learning. This approach includes setting benchmarks for robots to achieve, such as ecological comprehension or performing more complex, multi-step tasks.
To study this, Dr. Li created a method called BEHAVIOR, short for benchmark for everyday household activities in virtual, Interactive, and ecOlogical enviRonments. A simulation called iGibson 2.0 enables BEHAVIOR. BEHAVIOR is an AI benchmark and can characterize how well a machine can emulate complex human-like behavior and understanding.
Dr. Fei-Fei Li is the Sequoia Professor of Computer Science at Stanford University and Denning Co-Director of the Stanford Institute for Human-Centered AI (HAI). Her research includes cognitively inspired AI, machine learning, deep learning, computer vision, and AI+healthcare.
In 2017-2018, Dr. Li was a Vice President at Google and Chief Scientist of AI/ML at Google Cloud. She is also a co-founder and chairperson of the national non-profit, AI4ALL, which focuses on increasing inclusion and diversity in AI education. Dr. Li received her Ph.D. in electrical engineering from the California Institute of Technology. She is an elected member of the National Academy of Engineering.
For our next speaker, we are honored to welcome Dr. Fei-Fei Li. Dr. Fei-Fei Li is the Sequoia Professor of Computer Science at Stanford University, and Denning co-director of the Stanford Institute for Human Centered AI, HAI. Her research includes Cognitively Inspired AI, Machine Learning, Deep Learning, Computer Vision, and AI Healthcare. Before co-founding HAI, she served as Director of Stanford's AI lab. Dr. Li was a Vice President at Google and Chief Scientist of AIML at Google Cloud. Dr. Li is co-founder and Chairperson of the National Nonprofit AI for All, which is increasing inclusion and diversity in AI education. Please enjoy Dr. Li's keynote.
Hi everyone. Good morning, good afternoon, and good evening, wherever in the world you are. My name is Fei-Fei Li. I'm a professor at Stanford Computer Science department, and also co-director of Stanford's Institute for Human Centered AI. Today, I'm going to share with you some of the latest work from my lab, and the title of the talk is From Seeing to Doing: Understanding and Interacting with the Real World. I want to take you back 540 million years ago. What was the world like? Most animals, actually all animals, lived in the primordial soup of life, and there aren't that many species on earth. They mostly float in the water and catch a dinner whenever they float by. But something really mysterious happened around 540 million years ago. In a very short period of time, a matter of 10 million years, fossil studies have revealed that the number of animal species just exploded. Zoologists call this Cambrian explosion or the Big Bang of Evolution.
So, what made the number of animals, the types of animals, just increase exponentially? That has been a mystery for zoologists and biologists for a long time. There's one really prominent theory that emerged in the last couple of decades. And it's a theory that has inspired a lot of my own work. This is proposed by a zoologist from Australia called Andrew Parker. He says that Cambrian explosion is triggered by the sudden evolution of vision, which set off an evolutionary arms race, where animals either evolved or died. Basically, the ability to see the world, to see light, and to see dinner, is the driving force, or one of the major driving forces, of evolution. Animals, from that point on, evolved in all kinds of shapes and forms in order to survive, as well as to reproduce. From that point on to today, essentially all the animals in the world have some kind of vision. And not only in came vision, animals start to develop intelligence.
The nervous system developed more and more complicated apparatus. And now we have humans with one of the most complicated brains in the history of our world. That is a very, very, very brief history of vision. And that's how I think about my research. I view vision as a cornerstone of intelligence, whether it's biological or artificial. And in my work in AI and Computer Vision, I try to use vision to understand intelligence and to build intelligent machines. For the rest of the talk, I want to share with you what vision means. To me, it means two very important things. One is to understand the real world. The other is for doing things, interacting and acting in the real world. Let's just start by the first, understanding. Psychologists have told us, and use studies to show that human vision is remarkable.
Humans are capable of perceiving real world objects and things in a really phenomenal way. In this very early study, by a cognitive scientist, Irving Biederman in the '70s, he showed that the ability to recognize a bicycle in two different pictures. One coherent, one incoherent picture, was very dramatically different. Humans are better at seeing bicycles in a coherent thing, even though the bicycle itself hasn't changed locations. Concurrently, Molly Potter and some of her colleagues have shown that humans have a remarkable ability of detecting novel objects. In this video you'll see there's one frame that contains a person. Even though you've never seen this video, you have no problem of detecting where the person is, roughly what he or she is ... The location on the screen and the gestures. And keep in mind, every frame is only presented for a hundred millisecond. So, the frame change is at 10 Hertz, yet our visual system is very good at detecting these novel objects.
Back in 1996, about 25 years ago, neurophysiologist Simon Thorpe and his colleagues have shown, through brain EEG study, that as early as 150 milliseconds after a picture is shown, our brain shows a differential signal that can tell apart a picture with animals versus a picture without animals. And here we're talking about all kinds of animals, among all kinds of inaudible 00:06:51 images. So, it's quite a remarkable ability of human vision. Myself, about 15 years ago, have done an experiment when I was a graduate student, where we put human subjects in front of a computer screen and flashed to them real world photos masked by a wallpaper looking structure. And we asked human subjects to type what they see. And you can see some of these images are flashed in really, really fast way. Yet humans are very good at seeing what these things are. If the picture is presented for 500 milliseconds, it's like eternity, people can write novels if you pay them enough.
So, there's something special about our visual system. We can use it to understand the world. In fact, Alan Turing, one of the mostly inspiring persons in the history of computer science and inspiring to the field of AI, has conjectured to use a machine and teach it to understand the real world. And this is what I think seeing is for. Seeing is for understanding, is for making sense of what this visual world is about. So, back to this experiment. We see that humans are able to understand and make sense, and perceive the visual world. But what are the key elements or building blocks of this? If you look at what humans type, when presented with a picture like this, they talk about objects, like men, fist, face, grass, helmet, clothing, trees, dogs, or other things.
So indeed, object recognition is a building block of visual understanding, or of vision. And for those of you who are not familiar with this, what is object understanding? It's defined by the task of showing a visual system, whether it's a biological visual system like our own, or a computer, a picture, and the system is able to identify what is the main object in the picture. For example, this is a wombat in the picture. Why is it hard, or maybe it is not because humans can do this easily. It turns out it's actually quite a difficult task for computers. For one thing, computers have to see this in just numbers, you know, color numbers or, or luminance numbers. But going from numbers to the understanding that there is a wombat takes a lot of computation. In fact, objects, even though they can be the same object, they can come in many different and kinds of shape and form, and environment, not to mention there's a 3D world that renders these objects in very infinite number of possibilities.
In fact, understanding objects, or object recognition, has been a quest for more than half a century in Computer Vision. Early days, people tried to use hand design models to configure geometric shapes, to try to express objects in a mathematical language. And there was some heroic efforts in the '60s, '70s about object recognition. But as we fast forward, shortly before the turn of the century, Machine Learning as a field became a really important mathematical tool for Computer Vision and AI. And computer scientists learned that we don't have to hand design models. We can learn models and the parameters, but we have to rely on hand design features. So, we input features, whether it's patches of images, or some kind of encoding of pixels, and then we try to learn through data and through learning models, how these features configure. And there were a lot of great works that can come out of this.
As we start to push towards solving the problem of object recognition, one important aspect of the work, or research, came about, and that is the design of datasets and benchmarks. In the early days of object recognition, one of the most prominent dataset was European's Pascal VOC dataset, focused on 20 object categories. And it was released annually between 2006 to 2012 to encourage the field of Computer Vision. All the labs, worldwide, benchmarked against the testing data of this dataset, to assess the progress of the field. But the truth is, the world is a lot larger than 20 categories. It's fact, psychologists have estimated, tens of thousands, if not hundreds of thousands of categories of objects. Here, I want to bring you a quote of one of the most important psychologists that has influenced my thinking in terms of how to work in AI, and that's J. J. Gibson. Gibson has said, or a psychologist has paraphrased Gibson by saying, "Ask not what's inside your head, but what your head is inside of." This is a really important concept of encouraging us to think about an ecological approach to perception. So, when we are working on, say, object recognition, we know that it's a building block for understanding the world. We really need to emphasize on the scale of the real world.
Inspired by this concept, around 2007, my students and I were looking at the size of the datasets towards training object recognition models. And we were deeply unsatisfied because they hovered around thousands, if not tens of thousands, but truly small compared to the visual world that we experience. This is when we built together ImageNet, a dataset of 15 million images across 22,000 object categories. The goal of ImageNet is to really establish object recognition as one of the most important North Stars in Computer Vision, and use the benchmark dataset of ImageNet to encourage training with real world scale, and understanding with real world scale. Of course, a lot of you are already familiar with the rest of the history. ImageNet put together an international challenge annually between 2010 and 2017. And our testing dataset become a benchmark dataset for the field of Computer Vision Object Recognition research community.
In 2012, the winner of the ImageNet Challenge, especially Object Classification Challenge, was a convolutional neural network model. And that was the beginning of Deep Learning's revolution. Since then, we have seen a lot of different models built upon and benchmarked against the ImageNet, and the field has made tremendous progress. Here's another way to show how the ImageNet accuracy has evolved based on different models. So, a lot of progress has been engendered. But the world is more than just discreet object classes. In fact, there's a lot more than recognizing different objects. Here, I show two images where object detectors will tell you the same objects existing in these two things, the animal Lama and the person. They look similar. One picture looked like this. But if you look at the other picture, you realize these are two very, very different pictures, because of the relationship between the objects.
In fact, psychologists have long conjectured that to characterize a scene, or to understand a visual scene, the real visual world, relationships between objects must be coded in addition to the identities of objects. And this brings us to a following work of ImageNet by my students and collaborators on scene graph representation, where we look at not only object utilities in image, but also the attributes of objects, like the colors and expressions, and so on, as well as the relationship. In fact, every image is full of different relationships. We put together this dataset called Visual Genome, which contains 100,000 images, 3.8 million objects, 2.3 million relationships, and also 5.4 million textual descriptions of the scenes. Our following work looked at how we can predict visual relationships using scene graphs, and be able to achieve relationship recognition, for example, creating a model that can take a picture like this and call it Person Riding Horse, or Person Wearing Hat.
In fact, our model can also do a zero shot learning by looking at new relationships, such as Horse Wearing Hat, which is really rare in real world things, but with this compositional representation using scene graph, we're able to achieve this kind of zero shot learning on novel relationships. And some quantitative numbers show that our scene graph model for relationship estimation, as well as zero shot ... For relationship estimation beats the, back then, state of the art algorithms. Of course, the community has done a lot more interesting work since then, based on our scene graph representation here, I just list a few work by other labs on all kinds of scene graph modeling. And we have also extended this beyond static scenes into videos, and created a new dataset, a benchmark called Action Genome, and using spatial temporal scene graph to represent actions, and use this to perform tasks like recognition, or few shot recognition.
In fact, we have gone one more step further and been inspired by Alan Turing's words, that understanding the real world scene might connect the machine to also speaking English, in this case. So, we have worked on a series of models, where you can take a picture and perform image captioning or dense image captioning, as well as paragraph captioning. So, that was a very quick overview of one part of visual intelligence, which is the perception part. The perception part takes the pixels of the real world, feed it into the AI agent, and the agent is able to do important tasks, like Object Recognition, Visual Relationship Prediction, captioning, and so on. We introduced two data sets. One is ImageNet, one is Visual Genome, and a representation called Scene Graph. But our lab has done work around the problem of perception and around both benchmarks learning, representation, and connecting it to language.
But I want to now our shift gears and ask the question, "Is just passive understanding of the world enough for visual intelligence?" My answer would be no. I bring you to Plato's allegory of the cave, where he describes the passive perception of the world as prisoners tied to chairs. They're only forced to watch in front of them a play that's on full display in the back of their head. What they see are the shadows of the play, and they need to make sense of the real world. So, in fact, if we only look at this world in a passive way, we're a little bit like the prisoners of the allegory of the cave, and that would limit important functions of our visual experience. For example, we won't be able to fully understand how to interact with these objects, especially if we view them in angles that won't enable us to interact effectively.
In fact, real visual experience is extremely dynamic. You and I move around all the time, and animals move around all the time. And they do a lot of things. And that is what I think visual intelligence is about. Here, I'll share with you one favorite quote of mine, which is by philosopher Peter Godfrey Smith, who says the original and fundamental of the nervous system is to link perception with action. And this is a very famous experiment done on two kittens, back in the 1960s, where the newborn kittens, one is allowed to be active kitten, one is allowed only to be a passive kitten. The active kitten drives the yolk to explore visually what the world is like. Whereas the passive kitten is not allowed to explore by its proactiveness. It only sees the world as the active kitten moves around. And few weeks later, it was demonstrated that the active kitten has a much better developed perceptual visual system that the passive kitten. Not only we find this evidence in kittens, we also find evidence in monkeys and humans. That we have neurons, called mirror neurons, that are responsible to look at other people's movement, and to respond to that. So, in a way, we're hardwired to perceive movements and want to do the same. This brings me to the second half of the talk, which is seeing is for doing, in the world. And we complete our little schema of the world, that where the agent and the world now not only perceive, but act. And what are the critical ingredients of acting in the world? I think there are several, one is it should be embodied.
Moving around in the world is both explorative, as well as exploitative. It's most likely multimodal. A lot of times it's multitasking and it's really important we allow the agent to be able to generalize, and oftentimes it's social and interactive with other agents. This brings me to a far reaching dream of AI, which is to create robots that can perform a lot of complex human behavior, human tasks. This is Rosie, the robot. Of course, we're not there yet, but the rest of my talk I want to share with you some of our efforts towards robotic learning, using vision in real world things. Like I said, learning in active agent or embodied agent, is both explorative and exploitative. Let me just start with explorative, which is really learning to play. There is a huge body of literature in this. I won't be able to do it justice.
Some of my favorite work come from Allison inaudible 00:24:44 and Liz inaudible 00:24:47, and many others, where we imitate human newborns or human children, where they spend a lot of time playing without a purpose, yet they're learning and exploring the world. There are different flavors of this kind of Explorative Learning. There is the novelty based motivation, there is the skill based motivation, and the world model based motivation. And that's where our work is mostly anchored to. This is also related to previous work on predicting things, what to expect in future frames of videos and dynamics, but I won't get into the details. Basically, the work my colleagues and collaborators have done is to create a model that works on two models. There is a world model network that predicts the consequences of actions, of the actions of the embodied agent exploring a world.
And then there is a self model network that predicts errors of the world model, and tries to correct those errors. So, the intrinsic reward is a policy mechanism, where we choose actions that maximize world model loss, predicted by self model. And this is to maximize the exploration. Putting the self model and world model together is our intrinsically motivated self-aware agent. And we use that to explore a simulated world, 3D world, with objects. And you can see that the agent is able to explore, in a similar way, the blue line, like human babies, they start with the self motion, Ego motion, and then they start to look at one single object, and then they start to look at two objects. And this lower right panel shows you that the model learned through this self-exploration, or self motivation, can be able to do Downstream Object Recognition tasks better than a Random Policy model.
So, that was an example of Explorative Learning. Let's go to Exploitative Learning, which is much more goal-based. I'm going to just very brief, remind everybody that our eventual goal is to make robots to do long horizon tasks. But most of the work in today's robotic learning is very short punctual skills level tasks. So, we need to try to close the gap by encouraging robots to do longer horizon tasks, like cleaning up tabletops in a longer horizon way. Here with my students and collaborators, we put together a newer task programming model, we're inspired by actually Computer Vision research, but by enabling robotic learning to be compositional through skill set level tasks, and hierarchically stack them together, I'm not going to get into the details of this compositional representation, but here is a result showing that our robots are able to perform longer horizon tasks better than a state of the art result. And we can perform multiple tasks, not just color block stacking, but also sorting. In fact, we can also resist some interruptions. Here, the experimenter is going to disrupt what this color block task is. And the robot is able to compose the task automatically by itself and reset its goals, and complete the task.
So, again, I showed you one example of Explorative Learning towards long horizon tasks. Let me just say that this has been something that my lab has been focusing on in various angles in a newer line of work. We continue to look at long horizon and generalization of long horizon tasks by training a robot through curriculum learning, where we know the target task, but we know it's really hard for the robot to learn at the beginning. So, we generate a series of simpler target tasks to guide the robot. And this is related to a lot of generative models recently we have seen in the AI community. I'm going to skip the workflow of this model to show you that our robots are capable of learning different kinds of long horizon tasks by this curriculum training.
And in fact, even generalized to a different simulated desktop. So, that was three examples of Robotic Learning, but we're still not there yet in achieving this kind of real world task. There is a key missing piece. And that key missing piece brings us back to what today's robotic tasks are still mostly skill level task, and short horizon goals. Even if we try to do some longer horizon tasks, they tend to be small scale and anecdotal. Their experimenter picked tasks and lacked standard metrics. Either some of the tasks are in artificially simple environments, or if we bring the previously trained robot to a real experiment, it just fails miserably. Here's example of that. This brings us back to J. J. Gibson, that we need an ecological approach to perception and Robotic Learning.
And we have seen great progress in vision and NLP, and other areas of AI. We hope that in Robotic Learning, we can also work towards benchmarks that are large scale and diverse, ecological in general, complex, as well as standardized the evaluation metrics. This is our latest work called Behavior. Behavior is a benchmark for everyday household activities in virtual interactive and ecological environments. Behavior is enabled by a simulation environment called iGibson 2.0. It's an object-centric environment for Robotic Learning of everyday household. I'll just go over very quickly what iGibson is. iGibson is an environment that's very much inspired by a lot of concurrent work like Habitat, 3D World, Sapien, AI to Thor, and its goal is to be realistic in Object Modeling, photo realistic in rendering, simulation for both kinematic and non-kinematic state changes, and full physically simulated action execution, as well as allowing VR interface for human demonstration, I'll just skip the details of Gibson.
You can visit the details of the Gibson work on our Stanford website. Gibson enables behavior. This benchmark, as we said, we want to build an embodied AI benchmark that is complex enough, large scale, ecological, complex, and standardized the evaluation metrics. Gibson so far has a hundred different tasks. They are gathered through the American Bureau of Labor Statistics, and by sampling what Americans do in their daily life, and we put together this dataset of 100 tasks. So, in terms of statistics, Gibson is a lot more wider ranged compared to other data sets, focusing on just a narrower bend of tasks, and the statistics of behavior tracks, what the general statistics of the inaudible 00:33:53 tasks. It's also ecological, in general. Here, we show you by one example of clearing table, we show very different object positions, environments, the rendering of objects, the textures, we have done extensive statistics analysis by showing you the diversity of objects and things.
It's also long horizon and complex. We show you that an average behavior task length is measured by 300 to 20,000 steps. Whereas other task benchmarks are mostly smaller than 100 steps, or between 100 and 1,000 steps. Behavior is really going towards real life complexity in terms of tasks. Last but not least, it tries to standardize evaluation metrics by allowing a logic based representation to score the end state, compared to the initial state. I'm just going to skip the details of this and move on. Last but not the least, we also allow human VR demo in our Behavior benchmark, and we can use that for benchmarking against efficiency of execution. What excites me the most in this graph, is Behavior is really, really hard. We benchmarked task performance of Behavior against a couple of state of the art algorithms.
And I want you to look at this left most bar, where we use a default behavior without giving privileged information. You can see that the performance is close to zero. This is where I think we're starting to be on the journey of creating robotic embodied agents that can do really complex household activities and can be benchmarked against Behavior dataset. And for those of you who are interested, you can visit our website to learn more. So, in short, in seeing is for doing I've shared with you the iGibson environment that enabled the Behavior challenge, or Behavior dataset. I've also shared with you some of our earlier Robotic Learning work in curiosity based explorative learning, as well as long horizon task driven learning. We have done more work that you can find on our website. And I want to conclude by reminding all of us that vision is a cornerstone of intelligence. It enables us to understand and to do things in this real world. And our research is formulated for these two goals, especially inspired by J. J. Gibson's ecological approach to perception and Robotic Learning. Thank you everybody. This is my awesome team at Stanford with so many great students and collaborators, some of them are not even on this photo, but thank you so much. Bye.