Human Feedback Reinforcement Learning (HFRL)
Jan 06, 2025Last modified at Jan 06, 2025
It’s been quite a while since my days tinkering with ROS and OpenAI Gym environments. Looking at Lilian’s recent post on reward hacking brought back memories of those late nights debugging RL algorithms and watching agents learn (or often fail to learn) their tasks. Also made me read through my past survey paper and mutual information experiment.
The More Things Change…
What strikes me most is how the fundamental challenges in RL remain remarkably consistent. While we’ve seen RL make impressive strides in language model training through RLHF[1], the core ideas and algorithms haven’t changed dramatically. The same principles of exploration vs exploitation, reward shaping, and policy optimization that we grappled with years ago are still central to today’s applications.
Lessons from Reward Hacking
Perhaps the most profound insight I’ve gained over these past four years is how reward hacking in RL mirrors human behavior and cognitive biases[2]. Just as RL agents can exploit flaws in reward functions, we humans often optimize for proxy metrics that don’t truly align with our goals - whether in business, education, or personal development.
The Human Parallel
The parallel becomes especially clear when we consider how metrics can paralyze our sense of quality. Social media likes, quarterly profits, or even academic citations - these proxies can lead us astray from true value, much like how an RL agent might learn to game its reward function[1].
The Value of Exploration
This is why the concept of exploration in RL feels increasingly relevant to human learning. Just as we introduce randomization and exploration bonuses to prevent agents from getting stuck in local optima, we too need to maintain our capacity to explore and question our metrics[1]. It’s about keeping that balance between exploiting what we know works and exploring new possibilities that might better align with our true objectives.
The whole phenomenon reminds me of how addiction works - where the brain’s reward system gets hijacked, optimizing for immediate rewards while losing sight of longer-term wellbeing[1]. It’s a sobering reminder of how powerful and potentially misleading reward signals can be, whether in artificial or biological systems.
As I reflect on these parallels, I find myself oddly comforted. The challenges we face in RL - aligning proxy rewards with true objectives, balancing exploration and exploitation - aren’t just technical problems. They’re deeply human problems that we’re still learning to navigate, both in AI and in life.
Citations: [1] https://lilianweng.github.io/posts/2024-11-28-reward-hacking/