← Library
source: "raw/articles/generalizable-end-to-end-tool-use-rl-with-synthetic-codegym.md"
Summary: Generalizable End-to-End Tool-Use RL with Synthetic CodeGym
TL;DR: Researchers introduce CodeGym, a scalable framework that converts coding problems into interactive multi-turn tool-use environments for training LLM agents via reinforcement learning, achieving significant out-of-distribution generalization improvements.
Key Points
- CodeGym Framework: Transforms static coding problems into interactive RL environments by extracting atomic functions/logic into callable tools
- Scale: Contains 13,116 environments with 86,165 task configurations - largest among agent training benchmarks
- Environment Design: Uses POMDP formulation with sparse binary rewards, supporting 6.52 tools on average per environment and 44.07 steps per solution
- Verification Process: Two-stage pipeline (synthesis + verification) with pass@10 strategy to ensure environment solvability
- Training Results: Models from 7B to 72B parameters show consistent improvements across tool-use, multi-turn interaction, and reasoning tasks
- Performance Gains: Qwen2.5-32B-Instruct achieves 8.7 point accuracy improvement on τ-Bench benchmark
- Generalization: RL training significantly outperforms supervised fine-tuning on out-of-distribution benchmarks
- Quality Control: Filtering by tool-use complexity (10-256 tool calls) and difficulty (≤25% base model pass rate)
Concepts Covered
- Reinforcement Learning — Core training methodology using GRPO algorithm for tool-use agent training
- Tool-Augmented Language Models — LLMs extended with external tool capabilities for enhanced problem-solving
- Partially Observable Markov Decision Process — Environment formalization with hidden states and sparse rewards
- Out-of-Distribution Generalization — Key evaluation metric showing transfer to unseen domains and workflows
- Multi-Turn Interaction — Sequential tool calling patterns that mirror real-world agent workflows
- Synthetic Environment Generation — Automated creation of training environments from coding problems
- Chain-of-Thought Reasoning — Compared short-CoT vs long-CoT models across different training configurations
Figures and Images
- Figure 1 (
raw/articles/generalizable-end-to-end-tool-use-rl-with-synthetic-codegym.md): Overview showing transformation of coding problems into interactive environments for RL training - Figure 2: CodeGym generation pipeline with synthesis and verification stages
- Figure 3: Example environment showing binary search problem converted to tool-use task
- Figure 4: Statistics showing distribution of tools (6.52 avg) and steps (44.07 avg) per environment
- Figure 5: Distributed RL training architecture with environment server and rollout workers
- Figure 6: Training curves showing performance progression across model sizes
- Figure 7: Evolution of tool call behavior during training, increasing toward oracle solutions