source: "raw/articles/state-of-rl-for-reasoning-llms-or-a-weers.md"

Summary: State of RL for reasoning LLMs

TL;DR: A comprehensive overview of reinforcement learning methods for improving LLM reasoning (2024-2026), showing evolution from PPO through critic-free methods like GRPO, RLOO, and recent refinements targeting trust regions and loss aggregation.

Key Points

  • Historical progression: PPO dominated first-generation RLHF, but second-generation methods removed the memory-intensive critic component
  • GRPO breakthrough: Replaced learned value functions with group-relative baselines, cutting memory usage ~50% while maintaining performance
  • Core insight: LLM fine-tuning differs from traditional RL - models start pre-trained rather than random, making PPO's variance reduction largely unnecessary
  • Normalization bias: Standard deviation normalization (dividing by σ) consistently hurts performance by over-weighting nearly-solved problems
  • Loss aggregation matters: Sequence-level rewards with sample-level averaging creates length bias favoring verbose incorrect responses
  • Trust region evolution: Methods increasingly target softer/smarter trust regions rather than PPO's hard ratio clipping
  • Sample efficiency challenge: Current methods require 8-64 rollouts per prompt for relative baselines, expensive for costly verification
  • Domain limitations: Most progress limited to math/code with cheap verification; extension to subjective domains remains difficult

Concepts Covered

  • REINFORCE — foundational policy gradient method, essentially weighted SFT
  • PPO — dominant first-generation method with importance sampling and trust regions
  • GRPO — group-relative baseline replacing critic, memory-efficient breakthrough
  • RLOO — leave-one-out baseline, pure REINFORCE without clipping
  • Dr. GRPO — fixes length bias and standard deviation normalization issues
  • DAPO — asymmetric clipping, token-level aggregation, dynamic sampling
  • CISPO — clips weights not gradients, preserves learning on high-information tokens
  • DPPO — divergence-based trust regions instead of probability ratios
  • MaxRL — interpolates between RL and maximum likelihood, improves pass@k
  • ScaleRL — large-scale validation showing asynchronous training and FP32 benefits

Images

  • Figure 1 (img-0.jpg): REINFORCE algorithm illustration showing policy gradient flow

Related Concepts