source: "raw/articles/ui-voyager-a-self-evolving-gui-agent-learning-via-failed-experience.md"

Summary: UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

TL;DR: A mobile GUI agent that achieves 81.0% success rate on AndroidWorld by learning from failed trajectories through fork point detection and self-distillation, outperforming larger models and human performance.

Key Points

  • Performance: 4B model achieves 81.0% Pass@1 success rate on AndroidWorld, exceeding human performance (80.0%) and all baseline methods including much larger models (235B parameters)
  • Two-stage training framework:
    • Stage 1: Rejection Fine-Tuning (RFT) for automatic data-model co-evolution
    • Stage 2: Group Relative Self-Distillation (GRSD) for learning from failed trajectories
  • Fork point detection: Uses SSIM-based matching to identify critical decision points where successful and failed trajectories diverge
  • Credit assignment solution: Addresses sparse reward problem in long-horizon GUI tasks by providing dense step-level supervision
  • Self-corrective learning: Transforms failed trajectories into high-quality training data without manual annotation
  • Evaluation: Tested on 116 diverse AndroidWorld tasks across real-world mobile applications
  • Comparison: GRSD significantly outperforms standard RL methods (GRPO, PPO) which plateau around 76%

Concepts Covered

Images and Figures

  • Figure 1 (raw/articles/2603.24533v1/x1.png): Performance comparison showing UI-Voyager achieving 81.0% vs other models
  • Figure 2 (raw/articles/2603.24533v1/x2.png): Training pipeline overview showing RFT and GRSD stages
  • Figure 3 (raw/articles/2603.24533v1/x3.png): Fork point detection illustration with successful/failed trajectory comparison
  • Figure 4 (raw/articles/2603.24533v1/x4.png): RFT performance improvements and RL baseline comparisons
  • Figure 5 (raw/articles/2603.24533v1/x5.png): BrowserMaze task example showing fork point at step 12
  • Figure 6 (raw/articles/2603.24533v1/x6.png): SystemBluetoothTurnOff task example with fork point at step 0
  • Figure 7 (raw/articles/2603.24533v1/x7.png): Self-corrective sample construction process
  • Figure 8 (raw/articles/2603.24533v1/x8.png): GRSD vs GRPO/PPO training performance comparison
  • Figure 9 (raw/articles/2603.24533v1/x9.png): Performance on low-success-rate tasks

Related Concepts