source: "raw/articles/state-of-rl-for-reasoning-llms-or-a-weers.md"

Summary: State of RL for reasoning LLMs

TL;DR: A comprehensive overview of reinforcement learning methods for improving LLM reasoning (2024-2026), showing evolution from PPO through critic-free methods like GRPO, RLOO, and recent refinements targeting trust regions and loss aggregation.

Key Points

Historical progression: PPO dominated first-generation RLHF, but second-generation methods removed the memory-intensive critic component
GRPO breakthrough: Replaced learned value functions with group-relative baselines, cutting memory usage ~50% while maintaining performance
Core insight: LLM fine-tuning differs from traditional RL - models start pre-trained rather than random, making PPO's variance reduction largely unnecessary
Normalization bias: Standard deviation normalization (dividing by σ) consistently hurts performance by over-weighting nearly-solved problems
Loss aggregation matters: Sequence-level rewards with sample-level averaging creates length bias favoring verbose incorrect responses
Trust region evolution: Methods increasingly target softer/smarter trust regions rather than PPO's hard ratio clipping
Sample efficiency challenge: Current methods require 8-64 rollouts per prompt for relative baselines, expensive for costly verification
Domain limitations: Most progress limited to math/code with cheap verification; extension to subjective domains remains difficult

Concepts Covered

REINFORCE — foundational policy gradient method, essentially weighted SFT
PPO — dominant first-generation method with importance sampling and trust regions
GRPO — group-relative baseline replacing critic, memory-efficient breakthrough
RLOO — leave-one-out baseline, pure REINFORCE without clipping
Dr. GRPO — fixes length bias and standard deviation normalization issues
DAPO — asymmetric clipping, token-level aggregation, dynamic sampling
CISPO — clips weights not gradients, preserves learning on high-information tokens
DPPO — divergence-based trust regions instead of probability ratios
MaxRL — interpolates between RL and maximum likelihood, improves pass@k
ScaleRL — large-scale validation showing asynchronous training and FP32 benefits

Images

Figure 1 (img-0.jpg): REINFORCE algorithm illustration showing policy gradient flow

source: "raw/articles/state-of-rl-for-reasoning-llms-or-a-weers.md"

Summary: State of RL for reasoning LLMs

Key Points

Concepts Covered

Images

Related Concepts