← Library
source: "raw/articles/from-self-evolving-synthetic-data-to-verifiable-reward-rl-post-training-multi-tu.md"
Summary: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
TL;DR: A unified framework (AReaL-SEA) that combines self-evolving synthetic data generation with verifier-based reinforcement learning to train interactive tool-using agents that handle multi-turn conversations with humans and external environments.
Key Points
- Problem: Interactive tool-using agents face two bottlenecks: scalable data acquisition for multi-turn tool-use dialogues and RL training with noisy user simulation signals
- Solution: AReaL-SEA (self-evolving data synthesis) + verifier-based RL with user model fine-tuning
- Performance: On τ²-bench, achieves 73.0% pass@1 on Airline and 98.3% on Telecom with Qwen3-235B, matching/exceeding frontier models
- Data Engine: Hierarchical multi-agent system with orchestration layer (designs workflows, writes prompts) and execution layer (synthesizes tasks, trajectories, and verification functions)
- Self-Evolution: Reflection loop analyzes failures and updates both synthesis and evaluation plans iteratively
- RL Innovation: GRPO with trajectory-level group-relative advantages, dynamic filtering, and fine-tuned user simulators
- User Model Issue: Off-the-shelf models exhibit unstable behavior when simulating tool-using users, requiring SFT pre-training
- Results: Consistent improvements from SFT alone (e.g., Telecom 28.5% → 85.4%), further boosted by RL (85.4% → 95.6%)
Concepts Covered
- Multi-turn Tool-using Agents — core focus on agents that interact with humans and external environments across multiple conversation turns
- Self-evolving Synthetic Data — automated data generation system that improves through reflection loops and failure analysis
- Verifiable Reward RL — reinforcement learning using executable verification functions rather than human feedback
- Interactive Agent Training — specialized techniques for training agents that must collaborate with users throughout task completion
- User Simulation — challenges and solutions for creating reliable user simulators in RL training pipelines
- Group Relative Policy Optimization — GRPO adaptation for multi-turn interactive settings with trajectory-level advantages