Reinforcement Learning with Human Feedback

I would love to think of the ICML 2023 conference in Honolulu as a pivotal event for my P.h.D. research. During that time I was kind of talking to Reagan and having the time of my life. I remember in the conference there was a workshop session named sampling over discrete space. It felt serendipitous, like the universe was pointing me toward somewhere significant, much like how my brilliant colleague Tao pointing to the bright future here haha

The session ultimately was about how to use the massive structural discrete data, which has been major interests of many companies such as Google. The reason behind this was the killer success story of ChatGPT, which endorsed a bunch of subroutines such as SFT and RLHF (reinforcement learning with human feedback) Back to school I told my advisor I wanted to write a paper about leveraging preference user data to improve user satisfaction and how significant the theoretical contribution could be. His immediate reaction was ‘‘you do not have the data, and those companies will do astronomically better than you’’, no shit bro.

RLHF fundamentals

Why Alignment?

Alignment is all about ensuring that LLMs behave the way you want. Think of Asimov’s Three Laws of Robotics, only instead of robots, we’re taming text generators. Sometimes you want the model to obey. Other times, to be creative. And often, to just not hallucinate.

What is RLHF?

RLHF happens in three acts:

Supervised Fine-Tuning (SFT):
Train the model on quality prompt–response pairs ${ (x^{(i)}, y^{*(i)}) }_{i=1}^N $. Maximize:
$$ \log p_\theta( y_t^{*} | y_{ <t }^{*}, x ) $$
Reward Model Training:
Build a reward model $r_\phi(x, y) $ that learns what humans prefer by ranking outputs. You use pairwise data like:
$$ D_{\text{pref}} = {(x, y_A, y_B, \text{pref}_{AB})} $$
Policy Optimization (usually PPO):
Use RL (e.g., Proximal Policy Optimization) to update the model:
$$ \max_{\pi_\theta} \mathbb{E}[r_\phi(x, y)] - \lambda D_{\text{KL}}(\pi_\theta | \pi_{\text{ref}}) $$

Basically, before you use RL as an End-to-End method to obtain a good generation policy, you want to prepare a reward model for it, from all the discrete preference data, you got two options oftentimes:

Bradley–Terry (BT) Model: This models binary preference:
$$ \Pr(y_A \succ y_B | x) = \frac{\exp(r_\phi(x, y_A))}{\exp(r_\phi(x, y_A)) + \exp(r_\phi(x, y_B))} $$
Plackett–Luce (PL) Model: This generalizes to full rankings.

But here’s the thing—reward models can be wrong. And LLMs are great at gaming the reward (a.k.a. reward hacking).

Reward Model Misalignment:
Your $ r_\phi $ may not match your actual preferences.
Covariate Shift:
The policy drifts, and $r_\phi$ hasn’t seen those new outputs.
Sample Inefficiency:
PPO takes forever and a lot of annotations.

You can skip the reward model. Go straight for the preferences. DPO directly optimizes: $$ L_{\text{DPO}}(\theta) = - \sum_{(x, y^+, y^-)} \log \sigma\left(\beta \left[ \log \pi_\theta(y^+|x) - \log \pi_\theta(y^-|x) \right] \right) $$

Where $\sigma(z) = \frac{1}{1 + e^{-z}} $. DPO is not RL, not at all, but you may interpret $ \log \pi_\theta(y | x) $ as a latent reward:
$$ r_\theta(x, y) := \log \pi_\theta(y | x), $$ and that is where people started to use title “your xxx is a latent xxx” for some academic bullshit. Yeah, my PhD life is actually a huge waste of my life but I had no better things to do with it anyways.

Nash Learning from Human Feedback (NLHF)

Apparently having a preference moodel you also go game-theoretic. In NLHF, we don’t just want a good policy—we want a Nash equilibrium in the space of preferences. This Nash equilibrium defines a strategy that cannot be beaten by an ‘‘alternative self’’.

We define:

A preference model $P(y \succ y’ | x) \in [0, 1] $
Policy preference:
$$ P(\pi \succ \pi’) = \mathbb{E}_{x,y,y’}[P(y \succ y’ | x)] $$
Goal: find $ \pi^* $ such that:
$$ P(\pi^* \succ \pi’) \geq \frac{1}{2}, \quad \forall \pi’ $$

This can be solved with Mirror Descent or Policy Gradients: $$ \pi_{n+1} = \arg\max_\pi \langle \nabla J(\pi_n, \pi_n’), \pi - \pi_n \rangle - \frac{1}{\eta} D_{\text{KL}}(\pi | \pi_n). $$ Games like this is notoriously solvable via No-Regret methods, so you can come up with all kind of statistical bounds that ultimately boilds down to $\mathcal{O}(\sqrt{T})$ (or maybe a little different), or play with the reward/training structure a little bit more to come up with more fancy algorithms, and test them on some Hugging Face datasets.

Evaluations on tasks like TL;DR summarization show that:

NLHF beats RLHF in win-rate by ~5%
Using direct preference models improves ranking accuracy:
- Gemma-2B: 74.2% → 80.7%
- LLaMA3-8B: 87.8% → 94.8%

And… maybe it converges with fewer samples.

We are going to come back to this with a discussion of whether RLHF is actually useful. Now it’s the fall of 2025 and I already find myself so naive to wanted to do LLM alignment by myself, the resource and the effort put into this would have overwhelmed me, no doubt. A bigger question is, out of so many papers sprang in the last few years, what values can individual researchers actually create?