← Library
source: "raw/articles/ui-tars-2-technical-report-advancing-gui-agent-with-multi-turn-reinforcement-lea.md"
Summary: UI-TARS-2 Technical Report
TL;DR: UI-TARS-2 introduces a comprehensive framework for training GUI-centered agents using multi-turn reinforcement learning, data flywheel methodology, and hybrid environments, achieving significant improvements over its predecessor across GUI, mobile, browser, and game benchmarks.
Key Points
- Data Flywheel Architecture: Iterative system where the model generates new trajectories that are filtered and redistributed between continual pre-training, supervised fine-tuning, and reinforcement learning stages
- Multi-Turn RL Framework: Stabilized training using asynchronous rollouts, streaming updates, enhanced PPO with reward shaping, adaptive advantage estimation, and value pretraining
- All-in-One Sandbox Environment: Unified platform supporting GUI actions, file systems, terminals, and external tools across cloud VMs, browser sandboxes, and mobile environments
- Strong Benchmark Performance: Achieves 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, 73.3 on AndroidWorld
- Game Performance: Mean normalized score of 59.8 across 15-game suite (~60% human-level), outperforming OpenAI CUA and Claude Computer Use by 2.4× and 2.8×
- Parameter Interpolation: Merges domain-specialized agents through parameter interpolation rather than costly joint training
- Training Dynamics: Detailed analysis showing rising entropy during training (unlike reasoning RL), consistent reward improvements, and effective inference-time scaling
Concepts Covered
- GUI Agents — Native agent formulation with unified perception, reasoning, action, and memory
- Multi-Turn Reinforcement Learning — PPO-based training with specialized enhancements for long-horizon interactive tasks
- Data Flywheel — Self-reinforcing data generation and model improvement cycle
- Vision-Language Models — 532M parameter vision encoder with 23B active parameter MoE LLM
- Interactive Environments — Cloud VM and browser sandbox infrastructure for agent training
- Agent Memory Systems — Hierarchical memory with working memory and episodic memory components
- Reward Design — Verifiable rewards for deterministic tasks and generative outcome reward models for open-ended scenarios
- Parameter Interpolation — Method for merging specialized models without additional training cost
- Computer Use — GUI interaction through screenshots and human-like actions (clicks, typing, scrolling)