← Library
source: "raw/articles/gui-libra-training-native-gui-agents-to-reason-and-act-with-action-aware-supervi.md"
Summary: GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
TL;DR: GUI-Libra addresses key training challenges for native GUI agents through action-aware supervised fine-tuning and conservative RL, achieving significant improvements on mobile and web navigation benchmarks.
Key Points
- Problem: Open-source GUI agents lag behind closed-source systems on long-horizon navigation due to scarcity of high-quality action-aligned reasoning data and suboptimal post-training pipelines
- Core Issues Identified:
- Standard SFT with long chain-of-thought (CoT) reasoning hurts grounding accuracy
- Step-wise RL faces "partial verifiability" where multiple valid actions exist but only one is verified, causing reward ambiguity
- Solution Components:
- Released GUI-Libra-81K: curated dataset with 210 avg thought tokens per step (vs 56-85 in existing datasets)
- Action-aware SFT (ASFT): mixes reasoning-then-action and direct-action supervision with token reweighting
- Conservative RL with KL regularization and success-adaptive negative gradient scaling
- Results: Consistent improvements across GUI-Libra-3B/4B/7B/8B models
- AndroidWorld: +15.6% (4B) and +12.2% (8B) over base models
- Online-Mind2Web: +4.0% (4B) and +8.7% (8B)
- WebArena-Lite-v2: +12.5% (4B) and +11.3% (8B)
- Theoretical Analysis: Proves offline-to-online predictability depends on occupancy mismatch C(π) and off-demo validity mass η̄π
- Data Efficiency: Uses only existing open-source trajectory data, no costly online environment interaction during training
Concepts Covered
- Native GUI Agents — single end-to-end models that directly map instructions to executable actions
- Action-aware Supervised Fine-tuning — training method that emphasizes action and grounding tokens over reasoning tokens
- Partially Verifiable Rewards — RL setting where multiple actions may be correct but only one demonstrated action is verified
- Occupancy Mismatch — distribution shift between offline evaluation states and online policy-induced states
- KL Regularization — constraint mechanism that controls policy drift and improves offline-to-online predictability
- Success-adaptive Negative Gradient Scaling — method to downweight unreliable negative updates in ambiguous reward settings
- GUI Grounding — task of localizing target UI elements by predicting interaction coordinates
- Chain-of-thought Reasoning — explicit step-by-step reasoning traces that can interfere with precise spatial execution
- GRPO — Group Relative Policy Optimization algorithm used for RL training
Images and Figures
- Figure 1: Overview of GUI-Libra framework showing data curation, action-aware SFT, and conservative RL pipeline
- Figure 2: Example data format showing CoT reasoning traces paired with structured JSON actions
- Figure 3: Data source and action type distributions for SFT/RL datasets
- Figure 4: Analysis showing negative correlation between response length and grounding accuracy
- Figure 5: Overall training framework schematic
- Figures 6-17: Benchmark evaluations, ablation studies, trajectory examples, and case studies