← Library
source: "raw/articles/frontier-rl-is-cheaper-than-you-think.md"
Summary: Frontier RL Is Cheaper Than You Think
TL;DR: Reinforcement learning at frontier scale is more affordable than assumed because weight updates between RL checkpoints are 98%+ sparse, enabling delta compression that reduces cross-region transfers by ~94% and makes distributed rollout fleets practical.
Key Points
- Traditional mega-cluster approach assumes you must ship full 1TB checkpoints on every policy update, but this is unnecessary
- Between consecutive RL checkpoints, >98% of weights remain bit-equivalent in bf16 format due to small learning rates and sparse RL signals
- Delta compression reduces average transfer from 1024 GiB to ~20.3 GiB (1.98% of full model), cutting bandwidth by 94%
- Asynchronous RL tolerates a few minutes of policy staleness in exchange for much better compute efficiency
- Multi-region rollout capacity becomes usable when weight updates are small and routine rather than stop-the-world events
- Fireworks supported Cursor's Composer 2 training across 3-4 clusters worldwide using this architecture
- Approach works best for frontier-scale models where trainer and rollout can't fit on one compact cluster
Concepts Covered
- Delta Compression — core technique enabling 98%+ reduction in checkpoint transfer sizes
- Asynchronous RL — training pattern that trades slight policy staleness for compute efficiency
- Multi-region Deployment — distributing rollout capacity across geographic regions and cloud providers
- Policy Staleness — the acceptable lag between trainer updates and rollout fleet policy versions
- Rollout Fleet Architecture — infrastructure pattern separating training from inference sampling
- Weight Update Sparsity — empirical observation that most weights remain unchanged between RL steps
Images and Figures
- Checkpoint Cadence diagram showing periodic full checkpoints with delta updates in between
- Delta-Compressed Weight Updates flowchart illustrating the 3-step process: identify changed weights, package tensors, reconstruct and swap
- Policy Freshness Timeline comparing async updates vs full restarts and their impact on serving gaps