skip to Main Content
Back to Insights

Cognitive Collective: RLHF Is Not a Magic Wand for Alignment


    The world of Generative AI is moving fast. We’ve found the best way to keep up with the change is to go directly to AI experts and hear their thoughts, free of hype, hyperbole, and marketing spin. So, in this first of many posts, we invite the best minds in AI to share their perspectives and experience in this exciting field.

    Christopher Rytting is a 5th-year Ph.D student in computer science at Brigham Young University. He studies large pre-trained language models’ ability to simulate humans, both for direct study and as research aids in social science.

    By Christopher Michael Rytting

    Language Models trained via Reinforcement Learning from Human Feedback (also known as RLHF) are having their time in the academic sun after ChatGPT made its very flashy splash debut during machine learning’s biggest conference, NeurIPS 2022. There is talk of ChatGPT replacing Google, Twitter is athrob with claims of the end of prompt engineering via some general alignment in GPT-4, and lots of the prospective credit is going to RLHF. This hype feels like a fever, verging on the mystic, and I would like to both explore and dispel it.

    Why we love RL

    We can begin with the paper Reinforcement Learning: A Surveypublished in the Journal of Artificial Intelligence Research over a quarter-century ago–that summarizes reinforcement learning and calls its promise “beguiling–a way of programming agents by reward and punishment without needing to specify how the task is to be achieved.” The idea that reward and punishment alone could get us to artificial intelligence is attractive. Why? Because it resembles a very popular interpretation of natural selection itself, that we ourselves–and all intelligent life forms–were programmed via nothing but reward and punishment in the ultimate pursuit of procreation. The theory goes that every characteristic of evolved creatures (their intelligence, their joy, their pain, every scrap of performance or experience) stems from genetic variations more or less conducive to each creature’s reproductivity. Reinforcement learning invokes the sacrosanct elegance of the evolutionary myth to justify its own promise.

    Another reason RL is so compelling is that it can be seen as a rejection of the scientific intuition that systems should be–even can be–understood, a rejection that many would consider warranted by the last fifty years of AI research. Observing and considering phenomena have, historically in science, revealed causal mechanisms. Students of those phenomena can then understand and, in some cases, control them, as with nitrogen boosting crop growth or lift flying an airplane. This intuition, call it the value of understanding, was a designing principle of the symbolic, good old-fashioned AI (GOFAI) of the twentieth century. We would think about thinking, profile it, and teach our agents accordingly by writing their minds, rule by rule. However, excitement gave way to weariness as any terminus to this growing body of rules stayed out of sight, always blocked by a long line of edge cases precluding any notion of robust “intelligence.” The failures of GOFAI exhausted us, funding dried up, and an AI winter fell.

    What snatched us from that period was a change of approach, away from human-crafted and towards data-trained. In domain after domain (most notably games, natural language processing, and vision), training AI how to do via simulation and real-world data – beat training AI what to do via writing rules and logic. In 2019 (even before GPT-3, maybe the most striking example of this minimal supervision approach) Rich Sutton, one of RL’s central figures, summed up this shift and published The Bitter Lesson. The very title captures how difficult it was for scientists to accept their proper role, suppressing their deepest impulse (to understand) and ceding control to the learning algorithms that they would barely design before unleashing on data. An analogous piece by DeepMind is named Reward is Enough, a phrase that you can imagine either exclaimed or sighed.

    Reinforcement Learning then, can stand conceptually, almost ideologically, for evolution, or for what worked over what did not, despite our wishes and priors. Rejecting or questioning its value on the merits can feel like heresy–even though it is famously finicky, unstable, difficult, and thus far incapable of producing generally intelligent agents.

    RL in Natural Language Processing

    One place where Reinforcement Learning has worked well is in Natural Language Processing (NLP). A language model can be considered a policy function–taking text as input and outputting text that follows. A different language model can be trained as a “reward” model–identifying the better of two completions–using human feedback. These two can be paired and unleashed (the policy generating sets of candidate texts, the reward model identifying the better of the two, and the policy being encouraged accordingly), generating and grading in tandem with the ebb and flow of reward. Because we’ve called this reinforcement learning, and because there are two models playing off each other, this can evoke several powerful memories. For example, it resembles AlphaZero–descended from the system that taught itself to play the game Go and defeated the world’s top player, Lee Sedol. Or it can bring to mind generative adversarial networks (GANs), where one of two dueling models is responsible for generating realistic images and fooling the other model, a discriminator trained to tell real from fake. Human absence in these training regimes, which have resulted in some of AI’s greatest achievements, can feel like an eerie portent of the singularity.

    There are some sensible reasons (with more on the way, I’d imagine) that RL might work better in the regime of human feedback than in broader RL, and best in conjunction with Large Language Models (LLMs). Models don’t have to exhaust the search space before finding sparse rewards, but rather rely on human cues about which search directions are most promising. This is especially interesting with respect to The Bitter Lesson, because one potential takeaway from that piece (perhaps crude) is that human intervention is bad. But in the case of RLHF, maybe the more refined takeaway is that models must learn for themselves as gently steered by rather than entirely built by human designers. Furthermore, these RL models don’t start at all from scratch, since they not only begin with a pre-trained language model, but a fine-tuned one. I’m confident that a fine-tuned model is necessary; training with RLHF from a vanilla LLM would have been a neater (and thus more compelling) story than that of all the recent successes, which first finetune a policy before doing anything with a reward model.

    Philosophical and Practical Problems

    However, I’m afraid that the broader mystique around RL, along with the recent concrete successes of RLHF, will lull us into complacency on important alignment and approach questions. There are both fundamental philosophical reasons and more practical reasons for this.

    First, we don’t know how to define intelligence, consciousness, alignment, or even truth. Even the team at OpenAI  acknowledges this limitation, saying “During RL training, there’s currently no source of truth.” Many AI practitioners, and, indeed, anyone who wants to accomplish anything concrete in the world before perfectly understanding it, might reasonably roll their eyes at my making this point. But so long as we don’t have definitions for these things, we cannot speak about models possessing them or not. I don’t mean this normatively, but descriptively. Any claims about “alignment” in a model must be balanced upon the unwieldiness of the term.

    Alignment means something like “does what we want.” When a model could do what we want but hasn’t been properly coaxed into doing so, we call it misaligned. This is a big field of study since LLMs are both powerful and wild, exhibiting promise but misbehaving all the time. But when pursuing that fashionable goal, we should acknowledge that humans are themselves unaligned–in preferences, in behavior, in ideology. They are also dynamic, hour to hour and year to year. So we must ask: who and when and what is the target for alignment? Someone, be it the C-suite or a team of research scientists, will have to answer that question by choosing–from many possible candidates–a set of priorities, a formalized objective, and a training regime. From there, an actual training plan will be hatched.

    If the plan’s target for alignment is an open question, then surely things get more complicated in the plan’s execution, which involves actual labelers churning out the Human Feedback part of RLHF. Do we know who these humans are, and how they will steer the ship? Are they trustworthy, qualified to supervise the model, and willing to do so carefully? What is the difference between the cohort hired to train ChatGPT and the one from GopherCite, or between either of these now versus six months from now? Somehow, we’re satisfied to simply call them humans. Each of them is a person who might be what you would call smart or dumb, focused or distracted, grounded or excitable, nuanced or incautious, and will continue changing indefinitely. Yet papers describing RLHF will only refer to them as humans, a homogenous group of soldiers fit for the duty of ranking text completions.

    A good example of all this is InstructGPT, the predecessor to ChatGPT. OpenAI says they decided to “prioritize helpfulness to the user… However, in our final evaluations we asked labelers prioritize [sic] truthfulness and harmlessness (since this is what we really care about).” Of course, they easily could have prioritized many other sets of characteristics and the objective that would have entailed. Given this one that they happened to choose, though, they trained 40 human data labelers to score completions accordingly. Once the training is out of the hands of the architects and into the hands of the labelers, each labeler will have their own definitions of what is “true” (e.g., whether Donald Trump won/lost the 2020 election), what is “helpful” (e.g., liking or not liking some generated take on the ills of 2023 social media, or lack thereof), what is “harmless” (e.g., taking offense–or not–to a generated slur).

    None of this is meant to say that OpenAI did the wrong thing, or even that they could have done better, in training these labelers; they realize, instead, that there is no such thing as alignment to “broader human values.” In this sense, no model could ever be trained with the same RLHF twice – even when the training configuration looks the same. I wonder (in full speculation) whether we trust the human in the loop because somehow, madly, we fantasize our own selves into the trainers’ chairs and minds.

    Whatever the reasons we like RLHF, it probably has to do with the fact that “human” is in the name. This suggests an adult being present who before was absent, as if they–we–are taking back the wheel of a self-driving car on the cusp of a crash. But the truth is that humans were driving all along, despite ideas you might get from The Bitter Lesson that we backed ourselves out of the process. Besides humans needing to have developed the transformer architecture in the first place and tuned its hyperparameters just right, every token of natural language in any GPT-X training set was written by a human being.

    Evaluating models one at a time 

    So what does all this mean? For one, I’m not suggesting that we need to do away with any ambiguity in the definitions of our words before we can make progress on the problems they describe. Nor am I suggesting that we need to describe every detail about the state of the world in our papers. These are impractical, if not intractable propositions. But it is nearly impossible to define alignment, let alone achieve it. Human beings, in all their variance, capacity, and folly, make up the entire training pipeline, and we take such comfort in the fact that we call them all by the same name. But every human is individual, and models will change as their trainers do. Thus, they will succeed and break differently. As such, anyone using these models needs to find a balance between (a) simply accepting the risk that a candidate model will mess up and (b) evaluating a candidate model as deeply as possible. This is true in all of machine learning, and a big point of this essay is that RLHF won’t change that.

    Evaluating a model deeply means reading the papers, understanding how models were trained, how they were tested, what the results were, and the other details that will have bearing on their performance and generalization. For example, with InstructGPT, it’s worth going to the appendix and noting that there were 40 coders who were “quite young (75% less than 35 years old), fairly balanced between male and female genders, and mostly come from the US or Southeast Asia”. If you consider that a younger generation has a different idea than an older generation about what makes a text harmful, truthful, or helpful, then you might be more nervous deploying that model to an older group of users. Or if you consider obtaining a terminal degree to be important for some task–say, conducting literature reviews or answering questions about academic papers–then you should know that 0% of the labelers had a Ph.D.

    To use models, or talk about them authoritatively enough to get by with the vast majority of entrepreneurs or investors, is much easier than deeply understanding their training methodologies. The latter takes humility, hard work, and patience; insomuch as it’s even possible, it’s required for assessing or estimating performance, generalization, and robustness of these models. In this sense, RLHF looks a lot more like the rest of love, business, and life itself than “general alignment” or “the end of prompt engineering.” But it does make quite the chatbot!

    Back To Top