source: "raw/articles/from-self-evolving-synthetic-data-to-verifiable-reward-rl-post-training-multi-tu.md"

Summary: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

TL;DR: A unified framework (AReaL-SEA) that combines self-evolving synthetic data generation with verifier-based reinforcement learning to train interactive tool-using agents that handle multi-turn conversations with humans and external environments.

Key Points

  • Problem: Interactive tool-using agents face two bottlenecks: scalable data acquisition for multi-turn tool-use dialogues and RL training with noisy user simulation signals
  • Solution: AReaL-SEA (self-evolving data synthesis) + verifier-based RL with user model fine-tuning
  • Performance: On τ²-bench, achieves 73.0% pass@1 on Airline and 98.3% on Telecom with Qwen3-235B, matching/exceeding frontier models
  • Data Engine: Hierarchical multi-agent system with orchestration layer (designs workflows, writes prompts) and execution layer (synthesizes tasks, trajectories, and verification functions)
  • Self-Evolution: Reflection loop analyzes failures and updates both synthesis and evaluation plans iteratively
  • RL Innovation: GRPO with trajectory-level group-relative advantages, dynamic filtering, and fine-tuned user simulators
  • User Model Issue: Off-the-shelf models exhibit unstable behavior when simulating tool-using users, requiring SFT pre-training
  • Results: Consistent improvements from SFT alone (e.g., Telecom 28.5% → 85.4%), further boosted by RL (85.4% → 95.6%)

Concepts Covered

Related Concepts