← Library
source: "raw/articles/ui-voyager-a-self-evolving-gui-agent-learning-via-failed-experience.md"
Summary: UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
TL;DR: A mobile GUI agent that achieves 81.0% success rate on AndroidWorld by learning from failed trajectories through fork point detection and self-distillation, outperforming larger models and human performance.
Key Points
- Performance: 4B model achieves 81.0% Pass@1 success rate on AndroidWorld, exceeding human performance (80.0%) and all baseline methods including much larger models (235B parameters)
- Two-stage training framework:
- Stage 1: Rejection Fine-Tuning (RFT) for automatic data-model co-evolution
- Stage 2: Group Relative Self-Distillation (GRSD) for learning from failed trajectories
- Fork point detection: Uses SSIM-based matching to identify critical decision points where successful and failed trajectories diverge
- Credit assignment solution: Addresses sparse reward problem in long-horizon GUI tasks by providing dense step-level supervision
- Self-corrective learning: Transforms failed trajectories into high-quality training data without manual annotation
- Evaluation: Tested on 116 diverse AndroidWorld tasks across real-world mobile applications
- Comparison: GRSD significantly outperforms standard RL methods (GRPO, PPO) which plateau around 76%
Concepts Covered
- Multimodal Large Language Models — backbone architecture using Qwen3-VL-4B-Instruct
- Reinforcement Learning — addresses credit assignment challenges in sparse reward environments
- Group Relative Policy Optimization — baseline RL method that GRSD outperforms
- AndroidWorld Benchmark — evaluation environment with 116 diverse mobile GUI tasks
- Rejection Sampling — filtering mechanism for high-quality trajectory collection
- Self-Distillation — knowledge transfer from successful to failed trajectories
- SSIM Image Matching — computer vision technique for identifying equivalent screen states
- Mobile GUI Automation — target application domain for autonomous phone operation
- Credit Assignment Problem — fundamental RL challenge addressed by fork point detection
- Self-Evolving Training — iterative improvement without manual data annotation
Images and Figures
- Figure 1 (raw/articles/2603.24533v1/x1.png): Performance comparison showing UI-Voyager achieving 81.0% vs other models
- Figure 2 (raw/articles/2603.24533v1/x2.png): Training pipeline overview showing RFT and GRSD stages
- Figure 3 (raw/articles/2603.24533v1/x3.png): Fork point detection illustration with successful/failed trajectory comparison
- Figure 4 (raw/articles/2603.24533v1/x4.png): RFT performance improvements and RL baseline comparisons
- Figure 5 (raw/articles/2603.24533v1/x5.png): BrowserMaze task example showing fork point at step 12
- Figure 6 (raw/articles/2603.24533v1/x6.png): SystemBluetoothTurnOff task example with fork point at step 0
- Figure 7 (raw/articles/2603.24533v1/x7.png): Self-corrective sample construction process
- Figure 8 (raw/articles/2603.24533v1/x8.png): GRSD vs GRPO/PPO training performance comparison
- Figure 9 (raw/articles/2603.24533v1/x9.png): Performance on low-success-rate tasks