Articles

How Long Ouyang and his team at OpenAI trained InstructGPT to follow human instructions

How fine-tuning with reinforcement learning from human feedback can produce better results with less data at a lower cost

How alignment can unleash untapped potential in existing models

Long Ouyang is a research scientist at OpenAI working on human-in-the-loop machine learning. He helped build InstructGPT, a variant of GPT-3 with improved ability to follow human instructions, and continues to explore ways in which human feedback can be leveraged "to make GPT-3 more helpful, truthful, and harmless."

Recently, Long joined us as part of our AI Exchange program to discuss his work on InstructGPT. Scale's Aerin Kim, an Engineering Manager heading up a team of, in her words, "driven, passionate ML engineers who deeply care about their work," facilitated the conversation and led an engaging Q&A to delve further into the details. You can watch the full talk here or continue reading for key takeaways.

Make sure that ML models are optimizing functions that we care about

Large language models like the original GPT-3 may boast impressive capabilities, but they're not well aligned with user intent. As Long explains, "we want to use these models to perform interesting and valuable cognitive tasks...but they're not really fit for that purpose. They're trained to produce the next word." This frequently results in undesirable model behavior such as failing to do what is asked, hallucinating facts, or generating harmful or toxic content.

From what Long has observed, "what we'd really like to do is treat language models like assistants." The original GPT-3 model has a frustrating tendency to parrot instructions rather than respond in a meaningful way. Although it produces coherent text, it simply doesn't get that it's meant to accomplish a particular task. InstructGPT, on the other hand, knows that you're trying to give it a task and it makes its best effort to perform it, which is more aligned with assistant-like behavior.

InstructGPT (output on right) responds more meaningfully to human instruction than GPT-3 (output on left).

Use reinforcement learning to improve alignment by mimicking human preference

So how were Long and his colleagues able to make this happen? The secret is something they call "reinforcement learning from human feedback" or RLHF. This approach crucially relies on quality human-labeled data at multiple stages. For InstructGPT, OpenAI used a combination of freelancers they hired themselves and specialized labelers recruited by Scale AI.

The first step was to assemble a data set of prompts or instructions with appropriate (human-generated) responses and use this to fine-tune the output of GPT-3. This allowed them to generate a range of possible outputs for the same prompt which were presented to labelers to rank according to preference. This ranking, in turn, was used to train a reward model. The final flourish was to use reinforcement learning (OpenAI's Proximal Policy Optimization algorithm or PPO) to optimize the fine-tuned language model's output against the reward model.

Aerin wants to know how they landed on this approach amongst the available options for fine-tuning. "How did you get to the point where you wanted to try RL?"

"We had been working on this as just a research direction for a while before GPT-3 was commercialized," Long admits. It was at the point when OpenAI was about to launch the API that one of his team members suggested developing a more user-friendly way to interact with the models. "You actually have to do a surprising amount of work to get the original models to do what you want to the kind of robustness levels that might be acceptable for, say, commercial applications. And sometimes you may not even be able to get to that point."

Long and his team set to work on the problem of getting GPT-3 to understand human instructions and found some promising signs from reinforcement learning early on. "Once we saw those signs of life we decided to put a lot more energy into it."

"We still do use supervised learning as a tool," he hastens to add. "It's just not the main one."

Invest in human feedback rather than bigger models

By comparing the performance of InstructGPT with the original GPT-3, Long and his team discovered just how powerful the judicious use of human feedback data could be. The RLHF approach enhanced model performance far more significantly than increased model size, and considerably more than prompt engineering and supervised fine-tuning on human-generated prompt-output pairs alone.

Aerin's impressed. "InstructGPT with only 1.3B parameters did a lot better at following prompts than the original GPT-3 with 175B parameters." That's less than a hundredth of the compute. What this means, Long explains, is that it could make more sense to allocate budget to human feedback data than compute.

Alignment has the potential to "unlock" model capabilities

As if these results weren't already impressive enough, Long and his colleagues also discovered some intriguing side effects of RLHF. InstructGPT's instruction-following ability generalized to other languages, despite the language model being trained overwhelmingly on English, and it was even able to do some basic coding tasks although none of the labelers were programmers. Aerin wants to know if it's due to the ML method used, but Long's not sure. "I suspect that the generalization just comes from the diversity of different use cases that our customers have for these language model assistants." This diversity was reflected in the fine-tuning data.

However, although the generalizations are an unexpected bonus, Long cautions that there's still plenty of room for improvement. Hallucination is reduced but remains an issue for InstructGPT, particularly when questions assume false premises, and nothing stops the model from following harmful instructions. It also displays some odd behavior such as inappropriately hedging its answers and isn't very good at handling yes / no questions. The safety issue is a particular focus area for Long and his team - although InstructGPT shows improved alignment in terms of helpfulness and truthfulness, it will take further work, and likely more human feedback data, to improve harmlessness.

InstructGPT models are now the default language models on OpenAI's API. To learn more about how they were developed, check out OpenAI's blog post or the paper published by Long and his team. You can also use OpenAI's playground to experiment with the models themselves.

Using human feedback to align language models with user intent

In this article you'll learn:
- How Long Ouyang and his team at OpenAI trained InstructGPT to follow human instructions
- How fine-tuning with reinforcement learning from human feedback can produce better results 
  with less data at a lower cost
- How alignment can unleash untapped potential in existing models


OpenAI's InstructGPT

Guides

Learn how to implement a vision transformer in Python and speed up its operations by adding a TokenLearner layer.

A vision transformer learns from image data by treating image patches as tokens. Adding a token learning layer to a vision transformer speeds up its operations.

How to Build a Faster Vision Transformer for Supervised Image Classification

Google researchers showed back in 2017 that Transformer models could generate convincing text based on a prompt. In 2020, an overlapping team successfully classified images with a Vision Transformer (ViT). In early 2022, OpenAI released DALL·E 2 in 

, demonstrating a model that could reliably generate high-resolution, sophisticated images in a wide variety of learned styles from a simple text prompt. The OpenAI team described three different ways they used data selection to reduce certain harmful content. They used it to:

filter out training data to remove graphic content;

re-weight the remaining dataset to improve bias; and

use clustering to remove duplicate images to avoid memorization/regurgitation of images.

This is a great framework for generally improving content moderation of large ML models. Read on to learn how each of those three steps could be supplemented with synthetic data to improve the results even further.

DALL·E 2’s developers use an active learning approach to create classifiers that will find the examples of the image categories they would like to label—for example, images not suitable for work. They start with small datasets for both positive and negative examples (a few hundred of each), but typically, accurate, robust models require closer to 1,000 samples of each.

At this stage, procedurally generated synthetic examples of both positive and negative examples would make these classification models far more robust before even proceeding to the active learning stage. Thus, in an ideal scenario, the classifiers start out with far more accuracy this way, and the active learning stage can yield the same results in fewer iterations. Thus, things can happen faster and at lower cost.

Furthermore, humans are perhaps more intuitively creative at “defining” categories that aren’t work-suitable than they are adept at searching a large database of images and finding the NSFW examples in the haystack. With active learning, humans can reinforce the algorithm’s signal as to what is in bounds and what is out: Additional examples of positive and negative classification help the model improve in accuracy, eliminating the need to assess images one at a time from the (massive) dataset that OpenAI trained DALL·E 2 on. OpenAI’s active learning approach consisted of two main steps:

The harmful/benign binary classifier’s threshold hyperparameters were tuned such that recall was nearly 100%, but with an initially problematic, high false-positive rate. In this way, OpenAI’s annotation team was mostly labeling (or confirming) truly negative cases. This technique helped reduce the overall time required to label images, but it failed to expand the search space of the model to encompass new classes or clusters of harmful images not previously captured by the classifier.

The second step was to run many-fold cross-validation (also known as “n-fold”) to find positive samples in the existing labeled dataset that the model tended to misclassify as negative. This involved multiple training runs with different train-validation splits. Then, the team scanned their large remaining dataset of unlabeled images for nearest neighbors of these samples in a perceptual feature space. Only then were human labelers assigned to classify this new set of discovered images.

In spite of exploration through cross-validation, however, different clusters identified in feature-space might have been ideally suited to synthetic generation. This additional step might have reduced the need for exhaustive searching of sexual or violent data by human labelers.

Re-Weighting the Remainder of the Dataset to Reduce Unintended Bias

The OpenAI team next assigned a loss score to every image in the training set. They then calculated the ratio of the likelihood that the image is from the unfiltered dataset versus the filtered dataset. If the result is a higher value, they would weight the loss of sample further, implying that the filtered dataset lacked proper representation from a certain cluster. (In their blog post they mentioned a lack of females in their filtered dataset.)

However, synthetic data, intended to create additional points in a nearby cluster, could be generated around high-ratio samples rather than artificially inflating the loss of a specific image. Next, we could create a probability from this ratio and use that to choose whether or not to sample one of 

 nearest neighbors to this image from a synthetic dataset. This would reduce the risk of overfitting to particular examples with very high loss, and also unbias the model by providing more examples of the underrepresented class.

The next step the OpenAI team took was to deduplicate their dataset. Using random sampling, they examined subsets of the dataset and create clusters based on five parameter sets. Those clusters were then sampled for duplicate images. However, simply discarding the duplicate images within the dataset isn’t necessarily ideal. It should be feasible to expand the dataset by replacing duplicates with synthetic equivalents: Assess the class of the image, use a random seed, and procedurally generate an image of that class. Even if the classification is of low certainty, mitigating the reduction in dataset size while expanding it with a new, synthetic image should have double benefits. The resulting dataset would contain only unique images, and it would maintain its original size. Additionally, setting up an adversarial network (GAN) to synthesize images that the trained model thinks are in-class versus an impostor, might further resolve the boundaries between safe and unsafe content.

Looking back, OpenAI’s DALL·E 2 was a breakthrough for highly convincing image synthesis, and perhaps more importantly, it introduced three groundbreaking techniques to improve harmful content filtering via tweaks or maybe even “hacks” to their training data:

Filter out training data to remove graphic content.

Re-weight the remaining dataset to improve bias.

Use clustering to remove duplicate images to avoid memorization/regurgitation of images.

What’s somewhat surprising about these three particular innovations is that synthetic data provides an opportunity to improve on all three of them. Since, generally speaking, large machine learning models can be manipulated or even enhanced with the data you feed it, synthetic data is a cheap way to modify model performance to suit your needs. And particularly for “unsafe” classes, synthetic data reduces the need for human annotators to curate or generally experience these harmful images.

Suggestions for making DALL·E 2 give even better results by extending the training dataset with synthetic data

The DALL·E 2 team at OpenAI recently described an intriguing way to limit certain sensitive training images from their image-generation model. We hypothesize that you could improve on this technique by expanding the training dataset for DALL·E 2 with synthetic data.

How to Improve Content Moderation on Inputs to Large ML Models with Synthetic Data

This year’s Computer Vision and Pattern Recognition Conference (CVPR 2022) came and went, but the research presented there is here to stay, and it sets the stage for the year ahead. One theme that showed up in many papers was “data-centric AI,” meaning AI research that focuses not so much on the model architectures as on the quality of the data used to train those models. While massive strides have been made in improving the performance of deep learning models over the past decade, research into how to improve the quality of the data underlying these models is relatively new. 

Much of this research focuses on unique ways to curate, generate, or label data in resource-constrained environments, using techniques such as generative modeling or weak supervision. Because deep neural networks (DNNs) produce staggering improvements on various tasks but require a large amount of labeled data to train, these methods are becoming a priority in AI. 

Real-world data—especially edge-case data—is usually unannotated and can be expensive and difficult to gather and label, which makes synthetic data generation an appealing avenue for creating or augmenting the large datasets you need to train DNNs. Read on to learn about four papers, presented at CVPR 2022, that exemplify this new data-centric AI approach and how you can use the insights they bring in your day-to-day engineering and research work.

SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

Autonomous driving is one of the most exciting applications of AI, and the world in which autonomous vehicles operate is richly multivalent and ever-changing. Self-driving cars routinely encounter new environments they may not have seen before in their training data, such as a foggy country road in the pre-dawn morning or the dimly lit interior of a parking garage. It can be incredibly challenging to deploy cameras in the world to capture every scenario, but techniques such as synthetic data generation can alleviate some of the burdens of manual data collection and curation.

This is the approach that SHIFT takes. By generating synthetic images of eight different driving locations, SHIFT provides a highly accessible and useful dataset for manufacturers that want to train their autonomous vehicles to operate in various environments. The dataset provides a collection of synthetically generated sensor data, including views from an RGB camera set with five different cameras and a LiDAR sensor. It also “supports 13 perception tasks for multi-task driving systems: semantic/instance segmentation, monocular/stereo depth regression, 2D/3D object detection, 2D/3D multiple object tracking (MOT), optical flow estimation, point cloud registration, visual odometry, trajectory forecasting and human pose estimation,” the paper says.

While it’s not the first synthetic driving dataset, SHIFT notably improves on what its predecessors offered by providing 

environmental data (see Figure 1). Previous synthetic datasets captured environmental conditions at discrete points in time—for example, two images of a two-lane highway: one in the morning and one at night. 

Figure 1: The SHIFT model is designed to capture both discrete domain shifts (e.g., complete changes of scene) and continuous domain shifts (e.g., a gradual transition from day to night within a single video frame). Source: “

However, the authors of SHIFT realized that this approach is insufficient for training vehicles that can smoothly adapt to changes in their environment. Day fades gradually into night—there is no on-off switch that immediately extinguishes the sun. Similarly, fog may gradually roll in during early-morning hours to obscure a country road. SHIFT models these continuous phenomena by providing sensor captures of these states at time-varying positions during their onset and cessation, providing autonomous vehicles with a more nuanced understanding of their surrounding world.

LiDAR Snowfall Simulation for Robust 3D Object Detection

While SHIFT provides continuously varying synthetic data for common environmental conditions, manufacturers and researchers alike need to collect data on the most extreme weather conditions that an autonomous vehicle might encounter. A self-driving car that fails to navigate properly during a hurricane evacuation or that skids off the road during a blizzard could cause harm to drivers or pedestrians. Furthermore, it can be difficult to acquire sufficient real-world training data for these rare situations, since severe storms are both unpredictable and relatively infrequent. Synthetic data generation is especially well suited to this task.

Although SHIFT samples data from many sensors, the snowfall simulation paper focuses exclusively on LiDAR data, which provides some of the most robust 3D information available to autonomous vehicles. Because LiDAR uses a pulsed laser to measure distances between objects, it is vulnerable to the presence of environmental factors that scatter or refract the laser beam, in particular rain, snow, and fog. This results in extremely noisy data that often leads to inaccurate estimates of object distances, which can be catastrophic for the vehicle as it navigates its environment. 

By using advanced mathematical and physical modeling that amounts to treating snowflakes as individual spheres, the researchers provided a robust ground-truth signal that LiDAR-based autonomous vehicles can incorporate into their modeling systems during periods of intense snowfall (see Figure 2). The model also considers the wetness of roads, reducing the disruptive effect on traditional LiDAR systems. When used to train several 3D object detection models, the researchers’ simulated data provided a performance lift of up to 2.1% over the best existing detection model—a significant improvement.

Figure 2: For these three road scenes under heavy snowfall, the four columns to the right of the images show the performance of the model as trained using different data augmentations. The “snow + wet” augmentation is the only method that successfully trains the model to produce correct object detections for each of the three scenes. Source: 

3D Common Corruptions and Data Augmentation

Neural networks are quite sensitive to small perturbations of their inputs, which is why they are vulnerable to adversarial examples and data privacy attacks. One way to improve the robustness and accuracy of deep learning models is to train them using data augmentations in which filters or corruptions are applied to samples in the training set (see Figure 3). This makes the model better at extracting the core features essential to its semantic understanding of the image while leaving it less sensitive to extraneous noise.

Figure 3: Examples of the 3D augmentations and corruptions proposed in the paper. Source: 

For computer vision models, the augmentations applied to images have traditionally been 2D in nature: for example, applying image blur, adding gaussian noise to the pixels, changing the lighting of a scene, or occluding specific objects. This paper takes things a step further by introducing 3D augmentations. The idea is to make computer vision models more robust to the 3D geometry of a scene. A model, for example, should be able to detect the presence of a couch in a living room, regardless of how the camera is oriented relative to the couch, the depth of field, the lighting in the room, and so on.

Specifically, the paper introduces 20 three-dimensional corruptions relating to scene attributes such as depth of field, camera motion, lighting, video, weather, view changes, semantics, and noise. Most of the corruptions require only an RGB camera image and some notion of scene depth to be applied, although a few also require a 3D mesh. For datasets that do not have these attributes, many of the corruptions can still be applied using approximation techniques. This paper points to an interesting direction into 3D robustness research by demonstrating the usefulness of 3D corruptions to model benchmarking and training.

Many synthetic dataset generation techniques are task-specific. Kubric aims to change this. This scalable Python library interfaces between Blender, the popular open-source 3D modeling tool, and PyBullet, a physics simulation engine, to allow for the rapid generation of random, photorealistic 3D images, videos, and scenes (see Figure 4). By leveraging freely available 3D assets and textures, users of Kubric can generate terabytes of synthetic training data with just a few simple Python commands.

Figure 4: Example scene created and rendered with Kubric, along with some of the automatically generated annotations. Source: 

Perhaps just as important as the software itself is the suite of benchmark datasets and tasks the paper’s researchers introduced for computer vision models using data generated by the Kubric tool. The four tasks are object discovery from video, optical flow, texture-structure approximation using NeRF (neural radiance fields), and pose estimation. The richly annotated and large-scale datasets associated with each of these tasks give researchers new challenges to tackle, with much more supervised data than is likely to be available from a hand-labeled dataset.

The benchmark tasks show the power of Kubric to rapidly scale data in order to empower the next generation of deep learning models. Kubric’s developers plan to add more advanced capabilities in the future, including the ability to “include volumetric effects like fog or fire, soft-body and cloth simulations, and advanced camera effects such as depth of field and motion blur.” They also hope to incorporate more freely available 3D assets.

Data-Centric AI Underpins Next-Generation Deep Learning Applications

Data-centric AI is a rapidly growing subset of machine learning research and promises new ways to harness high-quality data for more accurate and generalizable deep learning models. At CVPR 2022, researchers demonstrated the rapid rate at which this new paradigm is catching on and showcased an incredible diversity of research into this emerging subfield. Synthetic data generation is at the heart of these applications, providing fine-grained control over the training dataset.

Synthetic data generation allows ML practitioners to design datasets that match their application’s unique needs, address problematic edge cases, and scale datasets to train large networks. Recent improvements in generative modeling techniques such as GANs, NeRFs, flow-based, and diffusion models have made large-scale, photorealistic synthetic image generation possible. These papers, as well as their freely available code and datasets, should be useful to researchers and engineers developing the next generation of deep learning applications.

Data-centric AI focuses less on the model architectures and more on the quality of the data used to train those models.

New research on synthetic models investigate ways to curate, generate, or label data in resource-constrained environments, using techniques such as generative modeling or weak supervision—and the need for labeled data to train deep neural networks is a priority in AI.

CVPR Research Roundup: Four Papers on Synthetic Data that Will Have an Immediate Impact on Your Data-Centric AI Workflows

Scale events bring together AI/ML engineers, researchers, leaders, and practitioners from across the globe. 

Scale Events

Home

Events

Content

Courses

People

Albums

Portals

Chat

Channels

Tools

Help

Your Privacy Choices

It looks like there is no account associated with the email you provided. The event is for members only. Please complete the application to join our community first.

Thank you for your interest in joining us

Thanks for your interest in joining the event. However, your account is still pending. Once it's approved, you'll be able to register for the event. Please contact us via events@scale.com if you need support.

Pending Approval

Login to the Scale Events community to connect with others, attend community events, and more!

Login or Sign Up for even more

To unlock all parts of the community and get the best experience, complete your profile.

Hey, complete your profile

It looks like there is no account associated with the email you provided. Please complete the application to join our community.

Thank you for your interest in joining Scale Events.

Thanks for your interest in joining Scale Events. We will review your request shortly and email you once it's approved. Meanwhile, please feel free to contact us via events@scale.com for any questions.

Scale Events Community

Videos

About Our Community & Member Benefits

Blog

What our members say