Build Better Reinforcement Learning with World Models and Self-Attention
AI training agents learn more quickly when given a world model to work from. David Ha explains how.
Adding world models and self-attention to reinforcement learning can produce robust systems that require minimal numbers of parameters, said David Ha, research scientist for Google Brain, during a presentation at the TransformX conference. This approach to artificial intelligence partially mimics the way human consciousness works.
Our mental models, made up of our unique experiences, determine how we interpret the world. —David Ha
Ha has experimented with building generative models for video game environments, including training agents to look at those environments via the world models.
Why Use World Models?
Pure reinforcement learning needs an assist, Ha said, to help AI agents learn more quickly. Ha quoted deep learning pioneer Yann LeCun, who said, “If you use pure reinforcement learning to train an agent to drive a car, it’s going to have to crash into a tree 40,000 times before it figures out it’s a bad idea. Instead, they need to learn their own models of the world so they can simulate the world faster than real time.”
Ha described how his team used world models to achieve instinct-level behavior for AI agents, which collected sequences of observations and then trained a variation autoencoder to develop a representation of space. The team then trained a recurrent neural network on the dataset to predict the future, amounting to a representation of time. These two representations—time and space—were then fed into a neural network controller, which based decisions on them using fewer than 1,000 parameters.
“Because the parameter count is so low, we can use evolution strategies or even random search to train the controller, while using back-propagation and graphics processing units to train the world model from the data collected,” Ha said. “For most environments, that requires some form of exploration. We need the agent to continually update its world model with new data collected whenever it interacts with the real environment. Lastly, the agent updates the policy by learning its side of the world model.”
In the end, Ha said, his team found they could then train agents entirely inside the latent space environment imagined by its world model.
Ha used world models for Atari games as environments in his research, and he said it was difficult to tell the difference between them and the environments generated by the neural networks. “While there are differences, both environments tend to be internally self-consistent,” he said. Controllers could solve problems solely from pixel observations and learn policies with better data efficiency than previous methods that rely on state observations, he said.
Adding Self-Attention
Ha said that human consciousness enhances the computational economy of the brain by selecting, amplifying, and propagating relevant information. Following that example, his team trained agents to decide upon an action only after identifying a small set of policies in the observational frame, using patches of pixels. Agents that were trained with such self-attention bottlenecks could perform with a few thousand parameters, he said.
Meanwhile, intentionally adding blindness, so that the agent was limited to 10 patches of pixels, produced better generalization abilities than in the earlier research, apparently because the agent couldn’t see things that might confuse it. When playing the Doom video game, for instance, the agent was not confused by changed walls or the addition of a text balloon, Ha said.
All of these variations would cause traditional reinforcement learning agents trained with conventional methods—or even with the world-models methods discussed earlier—to fail. —David Ha
Handling Shuffled Input
By using self-organization and self-attention, Ha said, his team has also been working on training agents to accomplish a task even when the observation space is subject to ongoing, arbitrary shuffling.
The idea is to treat each sensory neuron as an individual, recurring neural network agent and make each neuron process an arbitrary input stream. With each sensory neuron figuring out roughly the context of its input signal, the attention mechanism allows coordination between each sensory neuron [in order] to produce a coherent policy. —David Ha
The system was able to train on Atari's Puzzle Pong even when seeing only 30% of the puzzle pieces, and it could successfully play other games even with backgrounds it hadn’t trained on, he said. “We think this type of random-order bottleneck is in a sense forcing the agent to learn the essence of the task and the environment, giving it some robustness and generalization properties.”
Ha’s Tokyo-based team is expanding on this research, he said, and is exploring topics including complex systems and machine learning in arts and culture.
Learn More
For more details about how Google Brain is using world models and self-attention methods in its research, watch Ha’s October talk, “Building Better Reinforcement Learning with World Models and Self-Attention Methods.”
Image courtesy loop_oh/Flicker.