The Week in AI is a roundup of high-impact AI/ML research and news to keep you up to date in the fast-moving world of enterprise machine learning. From a trained neural network that hallucinates images in order to translate languages accurately to an AI system that uses traffic lights to reduce car congestion, here are this week’s highlights.
Researchers from MIT, IBM, and the University of California, San Diego, recently launched a new ML model called Valhalla, comprised of a trained neural network that sees a source sentence in one language, “hallucinates” or visualizes, an image of what it looks like, and then uses both to translate into a target language.
The team, whose research was presented at the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), discovered that pairing hallucinated images with text during inferences improved accuracy of machine translation, compared to current state-of-the-art techniques, which use text-only data. It also provided an additional boost for use cases with long sentences, under-resourced languages, and instances where part of the source sentence is inaccessible to the machine translator.
The researchers designed Valhalla (the name is derived from “visual hallucination”) using an encoder-decoder architecture with two transformers. This type of neural network model is suited for sequence-dependent data, such as language, that can pay attention to the key words in, and the semantics of, a sentence. One transformer generates a visual hallucination, and the other performs multimodal translation using outputs from the first transformer.
During training, the researchers used two streams of translation: a source sentence paired with a ground-truth image, and the same source sentence that is visually hallucinated to make a text-image pair. As for testing, the team put Valhalla against other state-of-the-art multimodal and text-only methods to measure its performance over 13 tasks, ranging from translation on well-resourced English translations (such as English to French), under-resourced English translations (such as English to Romanian), and non-English translations (such as Spanish to German).
The team also found that as sentences became longer, Valhalla’s performance over other methods grew due to its ability to process more ambiguous words. However, despite satisfactory performance, the model has limitations, especially around annotating pairs of sentences with an image for training purposes, which is an expensive and time-consuming process due to its frequent human-in-the-loop requirements.
Valhalla is also considered a black box because the usefulness of hallucinated images is based on assumptions made by the researchers. To validate its methods, the team plans to investigate what and how the model is learning. To further advance machine translation, the team will also explore other types of multimodal information such as speech, video, and touch, which could benefit low-resources languages spoken around the world.
Scientists from the University of Pennsylvania have developed an on-chip photonic deep neural network that can classify images in less than 570 picoseconds—comparable to a single clock cycle in state-of-the-art microchips. According to findings published in the journal Nature on this new study, the microchip can classify 2 billion images per second.
As a point of reference, traditional video frame rates range between 24 and 120 frames per second. This means if you speed up a two-hour, high-quality movie to last only 1 second, the microchip could still classify all 864K frames.
Deep neural networks are designed to mimic the human brain to power computer vision and speech-recognition systems, among others. However, they are limited by the hardware used to implement them and so face many challenges. First, they are usually deployed using digital-clock-based platforms such as graphics processing units (GPUs), which limits their computation speed to the frequencies of the clocks—less than 3GHz for most state-of-the-art GPUs. Second, unlike biological neurons which can both compute and store data, conventional electronics separate memory and processing units.
Running data back and forth between these components wastes both time and energy. By contrast, the photonic device, at just 9.3 square millimeters in size, could analyze images without the need for a separate processor and memory unit.
Scientists tested the microchip’s accuracy by identifying handwritten letters. During the first set of tests, the 9-neuron device classified 216 letters as either “p” or “d” and 432 letters as either “p,” “d,”, “a,” or “t” during the second set of tests. This resulted in 93.8% and 89.8% accuracy, respectively. In comparison, a 190-neuron, traditional deep neural network implemented in Python using the Keras library achieved 96% accuracy on the same images.
Next, researchers will attempt to classify videos and 3D objects with the photonic device and will use larger chips with more pixels and neurons to classify higher-resolution images. They also plan to convert signals such as audio and speech to the optical domain so they can be classified almost instantaneously by the photonic device.
Researchers at Aston University, United Kingdom, revealed a first-of-its-kind AI system that can scan live video footage of street traffic and adjust the lights to compensate, keeping traffic moving and decreasing congestion. The system leverages deep reinforcement learning, a method in which the software recognizes when it’s not doing well and attempts a new approach and continues to improve when it’s making progress.
Researchers refer to this reinforcement learning environment as a traffic control game or reward system, where the program gets a “reward” when it gets a car through a junction. Every time a car has to wait or there’s a jam, there’s a negative reaction instead.
Inadequate traffic signal timing, which is often the result of manually designed phase transitions, is a major cause of congestion. Also, current forms of traffic light automation used at junctions depend on magnetic induction wires that sit on the road and register cars passing over. A program counts the number of cars, then reacts to the data by changing the traffic lights among green, yellow, and red.
In comparison, the AI system “sees” high traffic volume before the cars cross the traffic lights, then decides. Testing revealed superior performance by the AI system when compared to traditional traffic devices.
To train their new program, the researchers built a state-of-the-art photorealistic traffic simulator called Traffic 3D. They taught the AI system to handle different traffic and weather scenarios. And despite being trained entirely on simulation, the model successfully adapted to real traffic intersections, proving to be effective in many real-world settings.
In the near future, the researchers will remove the need to provide specific instructions to the AI system, allow it to learn autonomously as it can be set up to view any traffic junction, real or simulated. As an added feature, the reward system could be manipulated to encourage the new program to let emergency vehicles through quickly. System testing on real roads is set for this year.
There are more than 7,000 known languages in the world, so it’s imperative that scalable machine translations perform at the highest possible level of accuracy to help people understand one another. But there’s a big challenge: Many known languages have limited or low resources, which makes it difficult for ML algorithms to find training data for translation tasks. Valhalla’s hallucination method is a step in the right direction, because it promises to deliver highly accurate translations by preserving the semantic meanings behind words and sentences.
Meanwhile, the design of a photonic deep neural network microchip proves that ML designers no longer have to choose between speed and accuracy. But researchers have more work to do on the input data size tradeoff, which often takes a toll on memory usage, to open the door to new possibilities in the way machine learning is applied across vision and language use cases.
Until next time, stay informed and get involved!