Scale Events
timezone
+00:00 GMT
AI Research and Industry Trends
April 13, 2022

The Week in AI: Advances in PaLM, Robot Art, DALL·E 2, and GSLM

# DALL-E
# Image Processing
# Generative Spoken Language Model (GSLM)
# The Week in AI

PaLM acquires multimodal capabilities, Ai-Da paints like a human, DALL·E 2 adds image editing skills, and GSLM uses voice data for training.

Greg Coquillo
Greg Coquillo

The Week in AI is a roundup of key AI/ML research and news to keep you informed in the age of high-tech velocity. From Ai-Da artistry to a scaled up PaLM, here are the week's highlights.

Google’s Pathways Language Model (PaLM) Achieves Breakthrough Performance

Last year, Google Research announced Pathways, a single model that could generalize across domains and tasks with great efficiency. To realize this vision, Google developed a new Pathways system that provides asynchronous distributed dataflows for machine learning tasks. This year, Google researchers published “PaLM: Scaling Language Modeling with Pathways,” which introduces a 540 billion-parameter, dense decoder-only transformer model trained on the Pathways system. 

PaLM is a single model that can efficiently train on multiple TPU v4 Pods. It was evaluated on hundreds of tasks involving language understanding and generation, achieving top-of-the-line, few-shot performance with significant margin against benchmarks. In recent years, large language models (LLMs) such as GPT-3, GLaM, LaMDa, Gopher, and Megatron-Turing NLG achieved state-of-the art, few-shot results on diverse tasks by leveraging model size scaling, sparsely activated modules, and larger datasets from diverse sources for training purposes. 

Despite these efforts, however, the capabilities of few-shot learning hadn’t yet been explored for better understanding. That opportunity led to the birth of PaLM. Trained on a combination of English and multilingual datasets that included high-quality web documents, books, Wikipedia, conversations, and GitHub code, PaLM is the first large-case use of the Pathways system. Its 6,144 chips represent the largest TPU-based system configuration used for training to date. 

In addition, data parallelism helped PaLM achieve a training efficiency of 57.8% hardware flops utilization, as well as breakthrough capabilities on language understanding and generation, reasoning, and code-related tasks. 

However, the progress seen with LLMs does not come without potential risks. In the recent Google Research blog “Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance,” software engineers Sharan Narang and Aakanksha Chowdhery propose the use of model cards and datasheets to document potential undesirable risks, as well information on intended use and testing of LLMs. 

According to them, “domain- and task-specific analysis is essential to truly calibrate, contextualize, and mitigate possible harms.” And in addition to understanding the risks and benefits of these models, we will need “scalable solutions that can guard against malicious uses of language models.”

OpenAI’s DALL·E 2 Adds Image Editing Skills

In January 2021, OpenAI introduced DALL·E, a text-to-image generation program. One year later, it introduced DALL·E 2, which generates more realistic, accurate, lower-latency images with four times greater resolution than its predecessor. DALL·E 2 can also edit existing images through a feature called “inpainting.” 

With inpainting, users can start with an existing picture, select an area, and ask the model to edit it. For example, upon request, the model can replace a painting on a living room wall with a different picture. It can fill or remove objects while accounting for the directions of shadows in a room. 

As with previous OpenAI resources, the tool hasn’t been released to the public. However, developers can sign up to preview the system while waiting for its availability to work with third-party apps. DALL·E 2 builds on CLIP, a computer vision system created by OpenAI last year. 

CLIP is able to look at images and summarize their content the way a human would. However, OpenAI iterated on this process to create “unCLIP,” an inverted version that allows DALL·E 2 to start with a text description to work its way toward an image. 

In other words, unCLIP helps DALL·E 2 learn the relationship between images and the text used to describe them. According to OpenAI, DALL·E 2 “uses a process called ‘diffusion,’ which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that image.” 

In addition, unCLIP is resistant to typographic attacks, a trick people used to fool CLIP’s identification capabilities by purposefully mislabeling objects with handwritten notes. 

Although the DALL·E 2 seems to be one-of-a-kind, other developers have created their own tool with similar functionalities over the past year. Wombo’s Dream mobile app, which generates pictures of anything users describe in a variety of art styles, gained popularity as a mainstream application for synthetic media. 

As it prepares to add DALL·E 2 to the list of APIs available as paid services, OpenAI is collaborating with vetted partners that will help mitigate safety risks by 1) preventing harmful generations of violent, hateful, or adult images; 2) curbing misuse through the filtering of violent or political content; and 3) using a phased approach to deployment based on learning from a limited number of external experts.

Ai-Da Paints Like an Artist

In a small room at London’s British Library, Ai-Da (assigned she/her) has become the world’s first robot painter. AI algorithms prompt Ai-Da, named after English mathematician and writer Ada Lovelace, to interrogate, select, make a decision, and ultimately create paintings for an average of five hours, while avoiding duplicate work. 

According to an interview in The Guardian, Ai-Da creator Aidan Meller raises the question of whether humans really want robots to make art. After all, “painter” is a title that’s been occupied by humans for centuries.

Ai-Da was devised in Oxford by Meller, with a team of programmers, roboticists, art experts, and psychologists that worked on Ai-Da more than two years ago. They update the robot as AI technology improves. Ai-Da’s recent demonstrations centered on “her” sketching capabilities and poem creation. The world premiere of Ai-Da’s solo exhibition will be at the 2022 Venice Biennale, which opens to the public on April 22. 

Titled “Leaping into the Metaverse,” the exhibition will explore the interface between human experience and AI technology, from Alan Turing to the metaverse. It will draw on Dante’s concepts of purgatory and hell to explore the future of humanity in a world where AI technology continues to encroach on everyday human life. 

During the interview with The Guardian’s Caroline Davies, Meller said that AI algorithms “are going to know you better than you do” due to the amount of data freely given about ourselves as we talk to our phones, computers, cars, and even kitchen appliances. Meller believes that we are entering a world where it will be difficult to distinguish between a human and a machine and that humans should evaluate whether they are comfortable with this eventual reality. 

Meta’s Generative Spoken Language Model (GSLM) Advances Expressive AI Speech

Meta Platforms’ AI research team unveiled a more realistic AI-generated speech system that uses “textless natural language processing” to model expressive vocalizations, such as laughter, yawning, and cries, in addition to spontaneous chit-chat in real time. Meta’s Generative Spoken Language Models (GSLMs), which are breakthrough natural language processing models, make it possible to build speech recognition systems without the use of transcribed audio data for training purposes. 

According to Meta’s blog, traditional AI systems learn only from written text and are limited in their ability to capture rich, expressive, nonverbal signals in speech. These nonverbal signals can include intonations, emotional expressions, pauses, accents, and rhythms, and all of them play a significant role in human communication. 

GSLMs not only capture what people say, but also how they say it, modeling the full expressive nature of oral language. This increases the models’ performance due to the added context. 

Meta intends to use this powerful capability for downstream applications that don’t rely on resource-intensive text labels, or as a generative tool for creating language from an audio prompt. By being able to model expressive vocalizations, AI systems can now convey nuances about vocalizations’ communicative intent or the sentiment they want to convey, such as boredom, irony, or irritation. 

A potential use case for Meta’s new AI speech system is speech-to-speech translation of movies. The current process requires the audio to be transcribed into text and then translated before being converted back into audio. This extremely complicated method removes expressivity and mistranslates idiomatic expressions. 

Meta’s GSLMs removes the need for text-based translation, generating far more realistic audio translations, with reduced latency. 

As an added benefit, the advancement of textless NLP can help make AI more inclusive, because the models can train on oral speech from hundreds of languages that lack standardized writing systems, such as Swiss German and dialectal Arabic. 

Why These Stories Matter

LLMs such as Google’s PaLM and Meta’s GSLM represent the new technology that allows AI-based applications that can provide humanlike interactions to global users. 

DALL·E 2 and Ai-Da offer interesting perspectives with regards to how creative we can be when assisted by an intelligent system. But we must remain cautious about how much we share with these systems that have become better than us at learning our own behaviors at scale.  

After all, despite the exciting advancements in AI, developers, policy makers, analysts, advocates, and users must continue to collaborate to build inclusive systems, as well as to create the guardrails that prevent bad actors from harming society. 

Until next time, stay informed and get involved!

Learn More

Dive in
Related
Blog
The Week in AI: An Image Hallucinator, a Photonic Chip, a Traffic Decongestant
By Greg Coquillo • Jun 15th, 2022 Views 2.8K