source: "raw/articles/ui-tars-2-technical-report-advancing-gui-agent-with-multi-turn-reinforcement-lea.md"

Summary: UI-TARS-2 Technical Report

TL;DR: UI-TARS-2 introduces a comprehensive framework for training GUI-centered agents using multi-turn reinforcement learning, data flywheel methodology, and hybrid environments, achieving significant improvements over its predecessor across GUI, mobile, browser, and game benchmarks.

Key Points

Data Flywheel Architecture: Iterative system where the model generates new trajectories that are filtered and redistributed between continual pre-training, supervised fine-tuning, and reinforcement learning stages
Multi-Turn RL Framework: Stabilized training using asynchronous rollouts, streaming updates, enhanced PPO with reward shaping, adaptive advantage estimation, and value pretraining
All-in-One Sandbox Environment: Unified platform supporting GUI actions, file systems, terminals, and external tools across cloud VMs, browser sandboxes, and mobile environments
Strong Benchmark Performance: Achieves 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, 73.3 on AndroidWorld
Game Performance: Mean normalized score of 59.8 across 15-game suite (~60% human-level), outperforming OpenAI CUA and Claude Computer Use by 2.4× and 2.8×
Parameter Interpolation: Merges domain-specialized agents through parameter interpolation rather than costly joint training
Training Dynamics: Detailed analysis showing rising entropy during training (unlike reasoning RL), consistent reward improvements, and effective inference-time scaling

Concepts Covered

GUI Agents — Native agent formulation with unified perception, reasoning, action, and memory
Multi-Turn Reinforcement Learning — PPO-based training with specialized enhancements for long-horizon interactive tasks
Data Flywheel — Self-reinforcing data generation and model improvement cycle
Vision-Language Models — 532M parameter vision encoder with 23B active parameter MoE LLM
Interactive Environments — Cloud VM and browser sandbox infrastructure for agent training
Agent Memory Systems — Hierarchical memory with working memory and episodic memory components
Reward Design — Verifiable rewards for deterministic tasks and generative outcome reward models for open-ended scenarios
Parameter Interpolation — Method for merging specialized models without additional training cost
Computer Use — GUI interaction through screenshots and human-like actions (clicks, typing, scrolling)

source: "raw/articles/ui-tars-2-technical-report-advancing-gui-agent-with-multi-turn-reinforcement-lea.md"

Summary: UI-TARS-2 Technical Report

Key Points

Concepts Covered

Related Concepts