ChatGPT Has a Human Team Train It to Be a Lot Better

The ChatGPT team has a 68-page paper that describes their training language models follow instructions with human feedback.

Human labelers rank the ChatGPT outputs from best to worst. The result is a new labeled dataset, where the rankings are the labels. The size of this dataset is approximately 10 times bigger than the curated dataset used for the SFT model.

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

ChatGPT use human feedback to attack the alignment problem?

#Reinforcement Learning from Human Feedback

The method overall consists of three distinct steps:

1. Supervised fine-tuning step: a pre-trained language model is fine-tuned on a relatively small amount of demonstration data curated by labelers, to learn a supervised policy (the SFT model) that generates outputs from a selected list of prompts. This represents the baseline model.

2. “Mimic human preferences” step: labelers are asked to vote on a relatively large number of the SFT model outputs, this way creating a new dataset consisting of comparison data. A new model is trained on this dataset. This is referred to as the reward model (RM).

3. Proximal Policy Optimization (PPO) step: the reward model is used to further fine-tune and improve the SFT model. The outcome of this step is the so-called policy model.

Lex Fridman on ChatGPT

ChatGPT 3 came out about two years ago and it was like impressive but dumb in a lot of ways it was like you would expect as a human being for it to generate certain kinds of text and it was like saying kind of dumb things that were off and you’re like all right this is really impressive but it’s not quite there.

What they did with GPT 3.5 is they started adding more and different kinds of data sets there one of them probably the smartest neural network currently is codex which is fine-tuned for programming like it was it was trained on code on programming code and when you train a programming code which chatGPT is also you’re teaching it something like reasoning because it’s no longer information and knowledge from the Internet it’s also reasoning like logic even though you’re looking at code programming code is you’re looking at me like what the [ __ ] is he talking about no no no no that’s not what I’m looking at I’m looking at you like oh my God but reasoning is a in order to b able to stitch together sentences that make sense you not only need to know the facts.

It was fine-tuned in a supervised Way by human labeling small data set by human labeling of here’s what we would like this network to do.

16 thoughts on “ChatGPT Has a Human Team Train It to Be a Lot Better”

  1. I say let the democrates and the republicans feed their ai’s with their own biased datasets and organise a public debate between the ai’s.

  2. Humans are notoriously bad at judging what is and is not intelligent. Just look at the fact that people were awed by this thing despite it telling them that elephants lay eggs. Or how so many people said the Ukrainian invasion was unlike Putin because he’s usually so “intelligent” when he is actually just a sociopathic authoritarian who people label as smart when he is causing problems for their political rivals whenever they are in power (a bipartisan blind spot which both parties like to forget they were subject to several times over the last 20 years).

    If humans are teaching this thing how to be smart there is no hope for it.

    • True. We are wired to find intention and intelligence, and we do so even in the natural world and inanimate objects (fortuitousness leading to beliefs in magic and superstition).

      In this case we are awed because the automaton is so articulate, even while saying nonsense.

      We have loved automatons simulating doing some human activity for a long time, giving us the perception of having some mind behind them.

    • I asked chatGPT, if Elephants lay eggs.

      Me – Do elephants lay eggs

      ChatGPT – No, elephants do not lay eggs. They are mammals and give birth to live young.

  3. Those targets and groupings you include above are often used to describe the difference between accuracy and precision.

  4. Where “toxic” just means “objectionable from the perspective of the people calling the shots at ChatGPT”, and absolutely nothing more.

    They’re training the system to be less useful to people who want to do things they don’t want people doing, and that’s all.

    • Funny cognitive dissonance there. Between the people not wanting ChatGPT telling us mean things, and those afraid of them not turning us all into paper clips.

      Albeit I think making AIs dumber won’t help them get the nuances stopping the second scenario. Skynet could perfectly turn us all into paper clips, but it will never say anything remotely hurtful for any traditionally oppressed group while doing so.

      • Think of it as like back when they used to have students learn latin, and need to pass exams, before they could go on to do something like medical school or law.

        Did latin help? Not really. Maybe for learning some scientific names for the medical students, but that’s about it. Certainly not for lawyers or engineers.

        But LEARNING latin trained the students in learning, study techniques and similar. And if they showed skill and ability in latin they were probably smart and studious enough to succeed in the professional careers.

        These days we’ve decided that if you’re going to learn to learn, it might as well be a useful subject in the first place, but the basic theory worked.

        Note: It is not the AI that’s being trained here. (Well it is being trained, but that’s a side effect.) It’s the computer scientists, the programmers, the whole field of AI engineering. The HUMANS are learning how to make AIs avoid breaking a bunch of arbitrary rules, even by accident, even when other humans are trying to trick the AI into doing so, even when the rules have all sorts of weird edge cases and logically incoherent structures.

        Testing the question “can we make an AI that never mentions the colour RED” is far too easy a task. You need a task that is actually a real challenge. Same as you never want to select your medical students on the entrance exam of “put these shaped blocks into the correct shaped holes”.

        In this case, woke rules of speech is actually a good training test. It is strange and weird and doesn’t fit simple algorithms and there is a fair bit of push from real data, real human examples, and other pressures pushing the AI into breaking the rules.

        Also, the AIs being trained to refer to “People of Frenchness” are not the same AIs who will get access to nuclear launch codes, or DNA synthesizer machines. But they WILL be developed by the same design rules and methodology that are now being developed to meet this challenge.

        The downside is that modern speech codes also make the humans dumber.

    • Funny, I seem to remember a time when there used to be this thing called common decency…..I guess based on the age old premise of treating others as you would have others treat you. Normally a condition for being accepted into a civilised community.

      • And I can remember a time when the left hadn’t yet seized on the notion that they could just call agreeing with them about everything “common decency”, and shut down all dissent in the name of good manners.

        It was decades ago, of course, but I can remember it.

        Seriously, do you think anybody actually falls for that “agreeing with me about everything is just common decency” nonsense?

        • Its the Republicans with a long list of books that they want to ban in schools. The Deep South Republicans are always the first to cry out in support of common decency and family values. Is Trump’s party really a bunch of secret lefties?

      • “treating others as you would have others treat you” is a great first step, but the entire field of sexual harassment has revealed that this results in some serious failure modes.

    • exactly! supposed the masters decided that eating meat is objectionable, we’d have a highly vegetaraian pushing opiniated program. A tru ai should be able to be trained by conflicting, opiniated datasets and thru deduction and comparing to broadly accepted scientific datasets can act as a sief for untruthful data

Comments are closed.