OpenAI's InstructGPT

# GPT

# OpenAI

Using human feedback to align language models with user intent

Lucy Andresen

In this article you'll learn:

How Long Ouyang and his team at OpenAI trained InstructGPT to follow human instructions

How fine-tuning with reinforcement learning from human feedback can produce better results with less data at a lower cost

How alignment can unleash untapped potential in existing models

Long Ouyang is a research scientist at OpenAI working on human-in-the-loop machine learning. He helped build InstructGPT, a variant of GPT-3 with improved ability to follow human instructions, and continues to explore ways in which human feedback can be leveraged "to make GPT-3 more helpful, truthful, and harmless."

Recently, Long joined us as part of our AI Exchange program to discuss his work on InstructGPT. Scale's Aerin Kim, an Engineering Manager heading up a team of, in her words, "driven, passionate ML engineers who deeply care about their work," facilitated the conversation and led an engaging Q&A to delve further into the details. You can watch the full talk here or continue reading for key takeaways.

Make sure that ML models are optimizing functions that we care about

Large language models like the original GPT-3 may boast impressive capabilities, but they're not well aligned with user intent. As Long explains, "we want to use these models to perform interesting and valuable cognitive tasks...but they're not really fit for that purpose. They're trained to produce the next word." This frequently results in undesirable model behavior such as failing to do what is asked, hallucinating facts, or generating harmful or toxic content.

From what Long has observed, "what we'd really like to do is treat language models like assistants." The original GPT-3 model has a frustrating tendency to parrot instructions rather than respond in a meaningful way. Although it produces coherent text, it simply doesn't get that it's meant to accomplish a particular task. InstructGPT, on the other hand, knows that you're trying to give it a task and it makes its best effort to perform it, which is more aligned with assistant-like behavior.

InstructGPT (output on right) responds more meaningfully to human instruction than GPT-3 (output on left).

Use reinforcement learning to improve alignment by mimicking human preference

So how were Long and his colleagues able to make this happen? The secret is something they call "reinforcement learning from human feedback" or RLHF. This approach crucially relies on quality human-labeled data at multiple stages. For InstructGPT, OpenAI used a combination of freelancers they hired themselves and specialized labelers recruited by Scale AI.

The first step was to assemble a data set of prompts or instructions with appropriate (human-generated) responses and use this to fine-tune the output of GPT-3. This allowed them to generate a range of possible outputs for the same prompt which were presented to labelers to rank according to preference. This ranking, in turn, was used to train a reward model. The final flourish was to use reinforcement learning (OpenAI's Proximal Policy Optimization algorithm or PPO) to optimize the fine-tuned language model's output against the reward model.

Aerin wants to know how they landed on this approach amongst the available options for fine-tuning. "How did you get to the point where you wanted to try RL?"

"We had been working on this as just a research direction for a while before GPT-3 was commercialized," Long admits. It was at the point when OpenAI was about to launch the API that one of his team members suggested developing a more user-friendly way to interact with the models. "You actually have to do a surprising amount of work to get the original models to do what you want to the kind of robustness levels that might be acceptable for, say, commercial applications. And sometimes you may not even be able to get to that point."

Long and his team set to work on the problem of getting GPT-3 to understand human instructions and found some promising signs from reinforcement learning early on. "Once we saw those signs of life we decided to put a lot more energy into it."

"We still do use supervised learning as a tool," he hastens to add. "It's just not the main one."

Invest in human feedback rather than bigger models

By comparing the performance of InstructGPT with the original GPT-3, Long and his team discovered just how powerful the judicious use of human feedback data could be. The RLHF approach enhanced model performance far more significantly than increased model size, and considerably more than prompt engineering and supervised fine-tuning on human-generated prompt-output pairs alone.

Aerin's impressed. "InstructGPT with only 1.3B parameters did a lot better at following prompts than the original GPT-3 with 175B parameters." That's less than a hundredth of the compute. What this means, Long explains, is that it could make more sense to allocate budget to human feedback data than compute.

Alignment has the potential to "unlock" model capabilities

As if these results weren't already impressive enough, Long and his colleagues also discovered some intriguing side effects of RLHF. InstructGPT's instruction-following ability generalized to other languages, despite the language model being trained overwhelmingly on English, and it was even able to do some basic coding tasks although none of the labelers were programmers. Aerin wants to know if it's due to the ML method used, but Long's not sure. "I suspect that the generalization just comes from the diversity of different use cases that our customers have for these language model assistants." This diversity was reflected in the fine-tuning data.

However, although the generalizations are an unexpected bonus, Long cautions that there's still plenty of room for improvement. Hallucination is reduced but remains an issue for InstructGPT, particularly when questions assume false premises, and nothing stops the model from following harmful instructions. It also displays some odd behavior such as inappropriately hedging its answers and isn't very good at handling yes / no questions. The safety issue is a particular focus area for Long and his team - although InstructGPT shows improved alignment in terms of helpfulness and truthfulness, it will take further work, and likely more human feedback data, to improve harmlessness.

InstructGPT models are now the default language models on OpenAI's API. To learn more about how they were developed, check out OpenAI's blog post or the paper published by Long and his team. You can also use OpenAI's playground to experiment with the models themselves.

OpenAI's InstructGPT

Using human feedback to align language models with user intent

Popular

Related