← Library
source: "raw/articles/state-of-rl-for-reasoning-llms-or-a-weers.md"
Summary: State of RL for reasoning LLMs
TL;DR: A comprehensive overview of reinforcement learning methods for improving LLM reasoning (2024-2026), showing evolution from PPO through critic-free methods like GRPO, RLOO, and recent refinements targeting trust regions and loss aggregation.
Key Points
- Historical progression: PPO dominated first-generation RLHF, but second-generation methods removed the memory-intensive critic component
- GRPO breakthrough: Replaced learned value functions with group-relative baselines, cutting memory usage ~50% while maintaining performance
- Core insight: LLM fine-tuning differs from traditional RL - models start pre-trained rather than random, making PPO's variance reduction largely unnecessary
- Normalization bias: Standard deviation normalization (dividing by σ) consistently hurts performance by over-weighting nearly-solved problems
- Loss aggregation matters: Sequence-level rewards with sample-level averaging creates length bias favoring verbose incorrect responses
- Trust region evolution: Methods increasingly target softer/smarter trust regions rather than PPO's hard ratio clipping
- Sample efficiency challenge: Current methods require 8-64 rollouts per prompt for relative baselines, expensive for costly verification
- Domain limitations: Most progress limited to math/code with cheap verification; extension to subjective domains remains difficult
Concepts Covered
- REINFORCE — foundational policy gradient method, essentially weighted SFT
- PPO — dominant first-generation method with importance sampling and trust regions
- GRPO — group-relative baseline replacing critic, memory-efficient breakthrough
- RLOO — leave-one-out baseline, pure REINFORCE without clipping
- Dr. GRPO — fixes length bias and standard deviation normalization issues
- DAPO — asymmetric clipping, token-level aggregation, dynamic sampling
- CISPO — clips weights not gradients, preserves learning on high-information tokens
- DPPO — divergence-based trust regions instead of probability ratios
- MaxRL — interpolates between RL and maximum likelihood, improves pass@k
- ScaleRL — large-scale validation showing asynchronous training and FP32 benefits
Images
- Figure 1 (img-0.jpg): REINFORCE algorithm illustration showing policy gradient flow