The Week in AI is a roundup of key AI/ML research and news to keep you informed in the age of high-tech velocity. From a single generalist agent multitasker to AI detecting heart disease via wearable device data, here are this week’s highlights.
Researchers at DeepMind trained Gato to complete 604 tasks, including engaging in dialogue, captioning images, playing Atari games, and stacking blocks with a robot arm, using the same model weights.
Artificial general intelligence (AGI) is considered by some in the industry to be the ultimate AI achievement. AGI promises to possess the ability to learn, plan, reason, represent knowledge, and communicate in natural language.
With Gato, DeepMind took a step toward this goal. As with other AI models, Gato learns by example, ingesting billions of words and images from real-world and simulated environments, including button presses, joint torques, and more in the form of tokens. These tokens represent data in a way that Gato can understand, giving it the ability, for example, to figure out which combination of words in a sentence might make grammatical sense.
Gato doesn’t necessarily perform these tasks better than the benchmark. For example, when chatting with a person, the system may provide an incorrect answer, such as “Marseille” in response to “What is the capital of France?” In captioning pictures, Gato incorrectly labels genders. And as for stacking blocks using a real-world robot, it successfully achieves this task 60% of the time.
However, for 450 of the 604 tasks, DeepMind claims, Gato performs better than a human expert more than half the time. From an architectural standpoint, Gato shares characteristics with OpenAI’s GPT-3 in that it’s a transformer model.
However, regarding parameter count, it is orders of magnitude smaller than single-task systems such as GPT-3. Gato has just 1.2B parameters, while GPT-3 has more than 170B. The researchers claim that Gato’s small size allows it to control a robot arm in real time.
Lastly, Gato has another limitation: the amount of information the system can remember in the context of a given task. The reason is that transformers can’t write lengthy essays or books without failing to remember key details, thus losing track of the plot.
Forgetting happens in any task, whether writing or controlling a robot arm, a weakness experts have called the Achilles’ heel of machine learning. This may raise concerns about whether Gato is a path to truly general-purpose AI after all.
A research team from Rikkyo University and AnyTech Co. published a new paper about Sequencer, an architectural alternative to Vision Transformers (ViTs) that uses traditional long short-term memory (LSTM) rather than self-attention layers.
Sequencer reduces memory costs by mixing spatial information with memory-economical and parameter-saving LSTM, while achieving ViT-competitive performance on long sequence modeling, the researchers said.
In the less than two years since their introduction, ViTs have revolutionized the computer vision field. They leverage transformer architectures' powerful self-attention mechanism to eliminate the need for convolutions and reach state-of-the-art image classification performance levels.
However, recent approaches such as MLP-Mixer and carefully redesigned convolutional neural networks have achieved performance comparable to ViTs, and researchers continuously aim to discover the optimal architectural design for computer vision tasks.
Sequencer’s architecture consists of bidirectional LSTM inspired by the Vision Permutator, and processes the top/bottom and left/right directions in parallel, improving accuracy and efficiency due to reduced sequence length. The Sequencer block has two subcomponents: 1) a BiLSTM layer that can mix spatial information memory economically and globally, and 2) a multilayer perceptron (MLP) for channel mixing. As with existing architectures, the output of the last block is sent to a linear classifier via the global average pooling layer.
When compared to CNNs, ViTs, and MLP models on the ImageNet-1K benchmark dataset, Sequencer achieved an impressive 84.6% top-1 accuracy in evaluations, surpassing ConvNeXt-S and Swin-S by 0.3% and 0.2%, respectively.
The researchers hope to improve understanding of the role of inductive biases in computer vision with this new paper, and hope to inspire the discovery of optimal architectural designs for image classification tasks.
At its recent I/O conference, Google unveiled a preview of its cloud business’s latest—and the world’s largest—publicly available machine learning hub. It’s located in the company’s Oklahoma data center. The cluster can not only reach 9 exaflops at peak performance, but does so while using 90% carbon-free energy.
The tensor processing unit (TPU) V4 Pod is at the heart of the new cluster, allowing researchers to use the framework of their choice, whether TensorFlow, JAX, or Pytorch. The V4 TPU has already allowed breakthroughs at Google Research in areas including language understanding, computer vision, and speech recognition.
Other potential customer data workloads are expected to be in the fields of natural language processing, computer vision algorithms, and recommendation systems. Customers can access clusters in slices ranging from four chips (one TPU virtual machine) to 1,000, which can be consumed via an on-demand, preemptible, or committed use discount price model.
A slice with at least 64 chips will use three-dimensional torus links, resulting in higher bandwidth for collective communication operations.
The V4 chip is equipped with 32-gibabyte memory, twice the amount as Google’s previous generation, while also doubling acceleration speed when training large-scale models. To align with Google’s sustainability goal of running the entire business on renewable energy by 2030, the V4 TPU provides better energy efficiency than its predecessor, producing three times the flops per watt than the V3 chip.
AI researchers from the Mayo Clinic in Rochester, Minnesota, used an AI algorithm to spot people whose weakening heart may be having difficulties pumping blood out to the rest of the body. They did so by analyzing electrocardiogram (ECG) data taken from an Apple Watch.
The condition is known as low ejection fraction due to the low percentage of blood pushed out of the heart with each beat.
Affecting 2% to 3% of people globally, low ejection fraction is linked to worsening heart failure and can go either undetected or be associated with shortness of breath or blood pooling in the legs. Previously, researchers demonstrated that AI could detect the condition using a hospital-based ECG with 12 leads and multiple electrodes wired to the chest.
The fine-tuned model now reduces the number of required leads to one. It also eliminates the need for expensive, sophisticated image testing such as the echocardiogram, which is currently used to detect the condition. Researchers collected more than 125K ECG logs over six months from participants in 11 countries who signed up for the study by email.
The tested model demonstrated an area under the curve of 0.88, a measure of prediction accuracy that is comparable to a treadmill-based cardiac stress test. With this new discovery, patients no longer need to travel to a clinic for advanced diagnostics. The researchers presented the study's findings at the annual conference of the Heart Rhythm Society.
As a single generalist agent inspired by the progress of large-scale language modeling, Gato plans to go beyond the realm of text outputs to complete AGI-level tasks. The model’s weight stability indicates that scientists should continue to focus on data quality to improve its performance.
Moreover, LSTM’s challenge to ViT for long-sequence modeling is a sign that researchers refuse to settle with the status quo for a given challenge, one that involves finding the optimal architecture for computer vision tasks. And Google’s new ML hub will ensure that these AI workloads will run while keeping efficiency and sustainability in mind.
Lastly, AI’s collaboration with wearable devices such as the Apple Watch to detect heart disease is an interesting case for computing at the edge that could save lives.
Until next time, stay informed and get involved!