source: "raw/articles/from-self-evolving-synthetic-data-to-verifiable-reward-rl-post-training-multi-tu.md"

Summary: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

TL;DR: A unified framework (AReaL-SEA) that combines self-evolving synthetic data generation with verifier-based reinforcement learning to train interactive tool-using agents that handle multi-turn conversations with humans and external environments.

Key Points

Problem: Interactive tool-using agents face two bottlenecks: scalable data acquisition for multi-turn tool-use dialogues and RL training with noisy user simulation signals
Solution: AReaL-SEA (self-evolving data synthesis) + verifier-based RL with user model fine-tuning
Performance: On τ²-bench, achieves 73.0% pass@1 on Airline and 98.3% on Telecom with Qwen3-235B, matching/exceeding frontier models
Data Engine: Hierarchical multi-agent system with orchestration layer (designs workflows, writes prompts) and execution layer (synthesizes tasks, trajectories, and verification functions)
Self-Evolution: Reflection loop analyzes failures and updates both synthesis and evaluation plans iteratively
RL Innovation: GRPO with trajectory-level group-relative advantages, dynamic filtering, and fine-tuned user simulators
User Model Issue: Off-the-shelf models exhibit unstable behavior when simulating tool-using users, requiring SFT pre-training
Results: Consistent improvements from SFT alone (e.g., Telecom 28.5% → 85.4%), further boosted by RL (85.4% → 95.6%)

Concepts Covered

Multi-turn Tool-using Agents — core focus on agents that interact with humans and external environments across multiple conversation turns
Self-evolving Synthetic Data — automated data generation system that improves through reflection loops and failure analysis
Verifiable Reward RL — reinforcement learning using executable verification functions rather than human feedback
Interactive Agent Training — specialized techniques for training agents that must collaborate with users throughout task completion
User Simulation — challenges and solutions for creating reliable user simulators in RL training pipelines
Group Relative Policy Optimization — GRPO adaptation for multi-turn interactive settings with trajectory-level advantages

source: "raw/articles/from-self-evolving-synthetic-data-to-verifiable-reward-rl-post-training-multi-tu.md"

Summary: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

Key Points

Concepts Covered

Related Concepts