source: "raw/articles/gui-libra-training-native-gui-agents-to-reason-and-act-with-action-aware-supervi.md"

Summary: GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

TL;DR: GUI-Libra addresses key training challenges for native GUI agents through action-aware supervised fine-tuning and conservative RL, achieving significant improvements on mobile and web navigation benchmarks.

Key Points

Problem: Open-source GUI agents lag behind closed-source systems on long-horizon navigation due to scarcity of high-quality action-aligned reasoning data and suboptimal post-training pipelines
Core Issues Identified:
- Standard SFT with long chain-of-thought (CoT) reasoning hurts grounding accuracy
- Step-wise RL faces "partial verifiability" where multiple valid actions exist but only one is verified, causing reward ambiguity
Solution Components:
- Released GUI-Libra-81K: curated dataset with 210 avg thought tokens per step (vs 56-85 in existing datasets)
- Action-aware SFT (ASFT): mixes reasoning-then-action and direct-action supervision with token reweighting
- Conservative RL with KL regularization and success-adaptive negative gradient scaling
Results: Consistent improvements across GUI-Libra-3B/4B/7B/8B models
- AndroidWorld: +15.6% (4B) and +12.2% (8B) over base models
- Online-Mind2Web: +4.0% (4B) and +8.7% (8B)
- WebArena-Lite-v2: +12.5% (4B) and +11.3% (8B)
Theoretical Analysis: Proves offline-to-online predictability depends on occupancy mismatch C(π) and off-demo validity mass η̄π
Data Efficiency: Uses only existing open-source trajectory data, no costly online environment interaction during training

Concepts Covered

Native GUI Agents — single end-to-end models that directly map instructions to executable actions
Action-aware Supervised Fine-tuning — training method that emphasizes action and grounding tokens over reasoning tokens
Partially Verifiable Rewards — RL setting where multiple actions may be correct but only one demonstrated action is verified
Occupancy Mismatch — distribution shift between offline evaluation states and online policy-induced states
KL Regularization — constraint mechanism that controls policy drift and improves offline-to-online predictability
Success-adaptive Negative Gradient Scaling — method to downweight unreliable negative updates in ambiguous reward settings
GUI Grounding — task of localizing target UI elements by predicting interaction coordinates
Chain-of-thought Reasoning — explicit step-by-step reasoning traces that can interfere with precise spatial execution
GRPO — Group Relative Policy Optimization algorithm used for RL training

Images and Figures

Figure 1: Overview of GUI-Libra framework showing data curation, action-aware SFT, and conservative RL pipeline
Figure 2: Example data format showing CoT reasoning traces paired with structured JSON actions
Figure 3: Data source and action type distributions for SFT/RL datasets
Figure 4: Analysis showing negative correlation between response length and grounding accuracy
Figure 5: Overall training framework schematic
Figures 6-17: Benchmark evaluations, ablation studies, trajectory examples, and case studies

source: "raw/articles/gui-libra-training-native-gui-agents-to-reason-and-act-with-action-aware-supervi.md"

Summary: GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Key Points

Concepts Covered

Images and Figures

Related Concepts