Building Better Reinforcement Learning With World Models & Self-Attention Methods
David is a Research Scientist at Google Brain. His research interests include Recurrent Neural Networks, Creative AI, and Evolutionary Computing. Prior to joining Google, He worked at Goldman Sachs as a Managing Director, where he co-ran the fixed-income trading business in Japan. He obtained undergraduate and graduate degrees in Engineering Science and Applied Mathematics from the University of Toronto.
Internal mental models, as well as consciousness and the concept of mind modeling, are major themes in neuroscience and psychology. However, we do not understand them well enough to create conscious artificial intelligence. In this talk, David Ha, Research Scientist at Google Brain, explores building "world models" for artificial agents. Such world models construct an abstract representation of the agent's environment that allows it to navigate it. David discusses artificial agents' use of world models and self-attention as a kind of limitation, connecting it in with computational evolution and artificial life ideas and methods. The goal of the presentation is to motivate scientists to create conscious machines by encouraging them to build artificial life that includes an internal mental model.
Speaker 1 (00:15): Next up, we're excited to welcome David Ha. David is a research scientist at Google Brain. His research includes recurrent neural networks, creative AI, and evolutionary computing. Prior to joining Google, he worked at Goldman Sachs as a managing director where he co-ran the fixed income trading business in Japan. He has undergraduate and graduate degrees in engineering, science, and applied mathematics from the university of Toronto. David, over to you.
David Ha (00:47): Hi, everyone. I'm really honored to be giving a talk at TransformX. My name is David Ha and I'm a research scientist at Google Brain. And today my talk is going to be about world models and attention for reinforcement learning. So right now I'm currently working at the Google Brain team inside Google. It's a research team formed in the early 2010s with a focus on combining open-ended machine learning research with large-scale computing resources. But the team has pioneered many important deep learning innovations that we all rely on and also created tools like TensorFlow, JAX, TPUs for research and development.
David Ha (01:35): And when I first started at Brain a few years ago, my research interests at the time were sequential models, and also thinking about the concept of doodling, because there are analogies with how we develop our abstract concepts of everyday things. But this eventually led me to explore the space of combining generative models with reinforcement learning and the line of work related to world models and self-attention for RL, which I'll be discussing in this talk. My work on world models was inspired by the cognitive neuroscience domain. Our mental models made up of our unique experiences determines how we interpret the world. They also affect our actions, how we assess opportunities, and also help us solve problems in everyday lives. In a sense how we look at the world is through the lens of our mental world models.
David Ha (02:32): So in my work, I look at building generative models of visual game environments. I also train agents to look at their world through the lens of their own generative world models. So Yann LeCun, one of the pioneers of deep learning had this nice quote on the problem of reinforcement learning. He said, "If you use pure reinforcement learning to train an agent to drive a car, it's going to have to crash into a tree 40,000 times before it figures out that it's a bad idea. So instead they need to learn their own internal models of the world, so they can simulate the world faster than real time." He likes to say that reinforcement learning is just a cherry of the cake. In consciousness studies, the global workspace theory proposes that consciousness, that conscious processing revolves around a communication bottleneck between selected parts of the brain, which are called upon when addressing a current task.
David Ha (03:39): So cognitive scientists, the [Hemi 00:03:42] also wrote that consciousness has a precise role to play in the computational economy of the brain, it selects, amplifies and propagates relevant information and thoughts. So in my word, I view world models as a means to enforce this communication bottleneck for the agents. And tools like latent space representations and attention can help select, amplify, and propagates a relevant information for the agents. Many researchers have been inspired by this book Thinking Fast and Slow by economist, Daniel Kahneman where the key idea is that our thinking is divided into a system one or instinctive thinking that captures our raw animal instincts and our ability to react. And system two, which captures higher order thinking involving planning or imagination. As I described later on, I think world models are not only associated with system two, but it's actually an integral part of both systems.
David Ha (04:49): So when I started this work, I looked at only using the very basic building blocks to build the simple generative models for RL agents. A generative model has to have an abstract representation of time and space. Here, I use a simple variational autoencoder to learn a low dimensional representation used to represent pixel images like these photos, celebrity photos, but here we're going to apply it to video games. For the representation of time, I use a recurrent neural network trained to predict the future. So like this Sketch-RNN model used to predict future strokes of a pen for drawing doodles. We can use a recurrent neural network to predict the distribution of future latent vectors from the variational autoencoder. And by sampling from this distribution, we can get a model to imagine different possibilities.
David Ha (05:53): I'll first go over my thoughts on how world models can be used for system one type instinctive behavior for our agents. So in RL environments, we are typically provided with an image observation frame at every time step, our agent starts collecting sequences of observations from random actions, and we first train variational autoencoder to develop a representation of space for the images collected. So we train the VAE on all the images collected from the data sets. And then here we can train a recurrent neural network on the data set to predict the future, to predict the next latent vector of all of these collective sequences. So here we have a representation of space in the VAE, and in the representation of time, we can simply use the hidden state vector of the recurrent neural network.
David Ha (06:54): And then we can feed the concatenation of this low-dimensional Lin vector Zed, and also the H, which is the representation of time into a neural network controller. And the controller cannot actually see the pixel images, but it can only see the abstract representations of space and time. And then this controller, a linear controller in our case will output an action which affects the environments, which will then cause the environment to produce the next frame. And we will train this entire system to affect... To enable the controller to perform the task, in this case to drive around the car around the track. So we applied this method to our car racing game in the OpenAI Gym environments. Now back in 2018, this method was the first method that was able to solve a task and achieve an average score above the required performance.
David Ha (07:55): You can notice that while the VAE and the RNN combined have millions of model parameters, the actual linear controller making the decisions only has less than a thousand parameters, kind of resembling the cherry on the cake, because the parameter account is so low, we can use evolution strategies or random search even to train the controller while using backpropagation and TPUs to train the world model from the data collected. So here we show that the agent can still drive around the track, even if we only give it a representation of space from a VAE, but hide the RNN's information, although the agent tends to still wobble a lot. When we give the agents back the representations of both time and space, we see here that it learns a good policy to solve the task and can attack the sharp corners on the track, taking advantage of the temporal information from the RNN.
David Ha (08:54): So far, we showed that the world model can be used as a bottleneck and also a useful feature extractor for an RL agent. But since the world model is a generative model, and which can predict the future, these representations allow it to generate a version of the environment that can actually be used to train an agent, which brings us to discuss using world models for a system two type approach and try to train agents entirely inside of its own imaginary world. So if we go back to the previous example, the setup still requires pixel observations from the actual environment, but since our model can predict future observations in latent space, we can simply remove the actual environments and removed pixel images from the setup entirely and use the world model to sample possible latent space paths of the future, and then train our agents entirely inside the latent space environment imagined by its world model. So we experiment with a simple Doom environment where the objective for the agent is to avoid fireball shot by monsters inside a room and survive for as long as possible.
David Ha (10:14): The result is a neurostimulator of this specific Doom game that we can actually play in ourselves. So back then I thought the idea of being able to play inside of an imaginary neural network world was so interesting that I spent some time putting this model over to JavaScript that making this web browser demo. And that actually allowed me to interactively tried the neurostimulator inside of a web browser. It's probably the first neurostimulator of a visual environment that runs in a webpage and it's probably the last. I put some knobs into the simulator and like being able to control the temperature of the uncertainty parameter, which allowed me to make a noisier version of the environment. And that turned out to be useful for this research. So as this models are just an approximation of their real environment, they're not perfect. Here, we see that when we train an agent inside of the world model and learn to move in a weird way so that all of the monsters in the room will never fire a single fireball.
David Ha (11:20): So it learned a cheat code that's only available in the neurostimulator, and hence here, it can live forever in the simulator, but such a policy takes advantage of imperfections of the neural network simulator and it will obviously fail to transfer to the real game. So instead, if we increase the uncertainty parameter of the world model and have it imagine a more chaotic version of the game, our agent is able to learn a policy inside of this noisier, more difficult version of reality, and the policy is able to transfer back to the actual game.
David Ha (11:59): So if world models are trained on data collected from random agents and they cannot extrapolate well beyond what has been experienced by the agent so far, for most of the environments that require some form of exploration, we need the agent to continually update its world model with new data collected whenever it interacts with the real environment. For example, even in this simple task, like a carpool swing up from pixels' environments, the world model doesn't have a good idea of what happens when the carpool is swing upwards, but after it learns a policy from imagination and is redeployed to the actual environment to collect more data, the new data can be used to improve the world model. After 20 iterations or so, here on the right, the world model learn more accurate dynamics of the environment. And the agent is able to learn the task by training inside the world model.
David Ha (13:01): This simple concept of iterative training has been explored in more detail in a subsequent work that presented the SimPle algorithm where in the three steps stage. First, the agent starts interacting with the real environments. And then second, the collected observations are used to update the current world model. And three, lastly, the agents updates the policy by learning its side of the world model. So in this work, they apply this SimPle iterative training to a scaled up version of the world model optimized for pixel prediction to the Atari domain. Back in 2019, when this work was first published, this model-based method achieved pretty good results for sample efficiency, for learning several Atari games, because most of the learning can be done in imagination mode while the interactions with the actual Atari game is only for collecting data and measuring the real performance, as opposed to actual RL, traditional RL methods where the learning takes place into real environments.
David Ha (14:14): So the world models learn for these Atari games are so good, that is actually difficult to tell the difference between the actual environments and the one generated by the neural network as you can see here, because the predictive frames have sort of stochasticity or randomness, we can compare the actual rollout from the Atari games with the generated version and see the difference over time. While differences exist here, both environments tend to be internally self-consistent. So this latent space world models have also been combined with a planning algorithm for the policy, rather than using a neural network controller. Now in this paper, we showed that planning in latent space allowed controllers to solve problems from only pixel observations and learn policies with better data efficiency compared to previous methods that rely on state observations. An improved version of this work from Danijar Hafner learns a discreet latent space representation of the world.
David Ha (15:25): And here, they showed good results in both Atari and continuous control domains. For instance, this MuJoCo humanoid task was once considered a really difficult task, even for a controller that had access to the actual state information, like position torch, like velocity and so on. But here, the controller must output a few dozen commands to control each of the individual motor joints of this humanoid. Here, the agent learns behaviors purely from predictions in a compact latent space and can control the humanoid from a raw video footage, it's kind of like operating a remote controlled humanoid robots by looking at it from a third-party's perspective. You can see the trajectories in the environment and the one sample in the latent space imagine the environment on the right figure.
David Ha (16:26): Another interesting world model neurostimulation project includes this one, this neuro driving simulator from a team at Nvidia that uses high quality GANs as part of the dynamics of the model trying to mimic a data set collected from US roads. They've built an interactive simulator here so that people can drive inside of it. I kind of like to do one and I did earlier, and it can do things like control the weather or the time of day and other driving conditions by manipulating the latent space of the neural network. So in that project called Pathdreamer train a world model on a large data set of indoor scenes. And the neurostimulator allows the user to imagine indoor scenes when given a single image to start it off, letting us navigate around this hallucinated buildings in 360-degree views.
David Ha (17:32): So far, we have discussed world models that use a latent space representations. I now want to discuss other types of bottlenecks that one might want to explore. In latent space models, like we just discussed, we typically have a generative model that includes the observations into a low-dimension latent vector with some prior distribution and have a policy search work within this bottleneck. But we can also use other forms of bottlenecks, such as visual attention. In this work led by my colleague, Yujin Tang, we trained an agent to first identify a small set of policies in the observation frame using a self-attention bottleneck, and then decide on the action only from the information obtained in this small set of patches. We found that agents trained with a self-attention bottleneck, not only can solve these tasks from pixel inputs with only a few thousand parameters. So like this allows us to train our non-differentiable policy networks using evolution strategies, but they're also better at generalization.
David Ha (18:45): So let me explain. So this idea is inspired by the work on selective attention in psychology. In this experiments, we are asked to count how many times the players in a way pass the ball. Most participants fail to notice the gorilla walking around because their attention is focused on the task at hand. So in our experiment, we forced this idea of inattentional blindness as a bottleneck in our agents. So while the agent receives the full input, we actually force it to see its world through the lens of a self-attention bottleneck, which picks only 10 patches from the inputs in the middle column as seen here. The controller's decision therefore is only based on these patches as shown on the right column.
David Ha (19:40): Our experiments showed that the agents with this inattentional blindness has better generalization ability, be simply due to its ability to not see things that can confuse it, it's able to drive in brighter or dimmer conditions even when there's an unexpected sidebar on a screen or a red blob stuck in the observations. So in the Doom experiments, it can operate in a room with, for example, a higher wall or a different color floor or a blob with texts in the middle of the screen. All of these variations would cause traditional RL agents trained with conventional methods, even with the world model's method as discussed earlier to fail, but we still have a long way to go. There are lots of work done in out of domain generalization and our agent here still fails when we changed the background for these car racing tasks, like when we change it to a video game or pure noise.
David Ha (20:44): So we trained this attention agents also on other games in Atari and also this slime volleyball game I made and we can see that it can learn to identify important patches of the screen needed to perform the task. So this approach might be useful when we want to get a better understanding and want to interpret the agent's actions. So lastly, I want to introduce a recent work where we used the concept of shuffling as the type of bottleneck constraints that the agent must under come. Now here, we pose a challenging problem where we want to train an agent to accomplish a task, even when the observation space can be arbitrarily shuffled, even many times during an episode or in its life.
David Ha (21:35): Like for example, on the right, on this car racing game, we can see that we can divide up the observation frames from the car racing environment into a large grid of patches and shuffle them, and occasionally reshuffle them during the episode so that we can keep on shuffling the order of the patches. And the question is, can we get a system to still be able to solve a task even when its observations consists only of these shuffled puzzle pieces? So in this recent work, we have been experimenting with ideas from self-organization and self-attention to solve this type of problem. The key idea here is to treat each sensory neuron as an individual recurrent neural network agents and have each neuron process an arbitrary input stream.
David Ha (22:32): So from each sensory neuron, we'll be integrating the information from its individual input stream over time. They can also broadcast a message out into a global latent code using an attention mechanism. The combination of the attention messaging and temporal information allows each sensory neuron here to figure out roughly the context of its input signal and the attention mechanism allows coordination between each sensory neuron to produce a coherent policy. So this work is inspired by some of the ordinary experiments involving these upside down goggles and also left-right bicycle. What you see actually gets flipped vertically or on the bicycle case, when you have to steer right to go left on the bike. So these subjects are actually be able to learn how to remap the changes meaningfully from their sensory inputs when things get switched around. But the difference is like some of these are really hard.
David Ha (23:43): For example, like it may take weeks for us to learn to adapt our concept of riding a bicycle to master this left-right bicycle. But the difference in this approach for this work is we want to get agents to instantaneously and rapidly adapt to changing environments. So it's like, whether we can do better than what we currently do in our biological brains. There is also the idea of a sensory substitution when the pioneer Paul Bach-y-Rita show that blind people are able to replace their visual experience from transmitting a camera's feed signal into a two-degree of pokes on a person's back, and later on when an electrode array is attached to the blind person's tongue, in this experiments, the subject that is able to learn to remap visual information received from other sensory organs.
David Ha (24:46): So one of the nice quotes that Paul Bach-y-Rita said is, "We don't really see with our eyes or taste with our tongue, but everything is done in the brain." For most RL agents, the observation spaces rather defined and it's expected that the meaning of each observation will not change during the agent's life. So here we see that if we can explicitly train an RL agents to adapt to a changing observation space. So in a variation of Atari Pong called Puzzle Pong, the agent must still be able to play this game while its input spaces constantly reshuffled, as we see here on the right. A self-organization and self-attention allow the agent to treat the observation space as an unordered variable length of inputs.
David Ha (25:45): The system will be able to train to perform the Puzzle Pong to some extent, even if it only sees 30% of the available puzzle pieces. What is remarkable is that even when we given more puzzle pieces to performance will increase later on. We can also apply this concept to traditional ordinary environments with state observation space, rather than pixels. Here, this PyBullet Ants 28 observations are reshuffled every hundred steps or so, so that the RNN sensory neurons can quickly adapt to a new observation space. So lastly, we found some benefits of training agents to work on a shuffled observation space, since we usually want to show that what we're doing is actually useful. In the current recent environment, we find that these permutation and variant agents are able to generalize and perform well and out of distribution texts. Like this one, when the background has changed, even when it has never seen the background before. So this agents are trained on a single green grass background, but then we try to test it on new backgrounds that has never seen during training.
David Ha (27:10): We think this type of a random order bottleneck is in the sense forcing the agents to learn the essence of the task and the environments giving it some robustness and generalization properties, in addition to being able to work with a shuffled observation space. So this concludes my talk today. Before I finish, I would like to introduce my research group in Tokyo. We try to explore AI topics beyond deep learning and look at areas like artificial life evolution, complex systems, and so on. And also we want to explore machine learning for arts and culture. Thank you.