source: "raw/articles/generalizable-end-to-end-tool-use-rl-with-synthetic-codegym.md"

Summary: Generalizable End-to-End Tool-Use RL with Synthetic CodeGym

TL;DR: Researchers introduce CodeGym, a scalable framework that converts coding problems into interactive multi-turn tool-use environments for training LLM agents via reinforcement learning, achieving significant out-of-distribution generalization improvements.

Key Points

  • CodeGym Framework: Transforms static coding problems into interactive RL environments by extracting atomic functions/logic into callable tools
  • Scale: Contains 13,116 environments with 86,165 task configurations - largest among agent training benchmarks
  • Environment Design: Uses POMDP formulation with sparse binary rewards, supporting 6.52 tools on average per environment and 44.07 steps per solution
  • Verification Process: Two-stage pipeline (synthesis + verification) with pass@10 strategy to ensure environment solvability
  • Training Results: Models from 7B to 72B parameters show consistent improvements across tool-use, multi-turn interaction, and reasoning tasks
  • Performance Gains: Qwen2.5-32B-Instruct achieves 8.7 point accuracy improvement on τ-Bench benchmark
  • Generalization: RL training significantly outperforms supervised fine-tuning on out-of-distribution benchmarks
  • Quality Control: Filtering by tool-use complexity (10-256 tool calls) and difficulty (≤25% base model pass rate)

Concepts Covered

Figures and Images

  • Figure 1 (raw/articles/generalizable-end-to-end-tool-use-rl-with-synthetic-codegym.md): Overview showing transformation of coding problems into interactive environments for RL training
  • Figure 2: CodeGym generation pipeline with synthesis and verification stages
  • Figure 3: Example environment showing binary search problem converted to tool-use task
  • Figure 4: Statistics showing distribution of tools (6.52 avg) and steps (44.07 avg) per environment
  • Figure 5: Distributed RL training architecture with environment server and rollout workers
  • Figure 6: Training curves showing performance progression across model sizes
  • Figure 7: Evolution of tool call behavior during training, increasing toward oracle solutions

Related Concepts