source: "raw/articles/gui-libra-training-native-gui-agents-to-reason-and-act-with-action-aware-supervi.md"

Summary: GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

TL;DR: GUI-Libra addresses key training challenges for native GUI agents through action-aware supervised fine-tuning and conservative RL, achieving significant improvements on mobile and web navigation benchmarks.

Key Points

  • Problem: Open-source GUI agents lag behind closed-source systems on long-horizon navigation due to scarcity of high-quality action-aligned reasoning data and suboptimal post-training pipelines
  • Core Issues Identified:
    • Standard SFT with long chain-of-thought (CoT) reasoning hurts grounding accuracy
    • Step-wise RL faces "partial verifiability" where multiple valid actions exist but only one is verified, causing reward ambiguity
  • Solution Components:
    • Released GUI-Libra-81K: curated dataset with 210 avg thought tokens per step (vs 56-85 in existing datasets)
    • Action-aware SFT (ASFT): mixes reasoning-then-action and direct-action supervision with token reweighting
    • Conservative RL with KL regularization and success-adaptive negative gradient scaling
  • Results: Consistent improvements across GUI-Libra-3B/4B/7B/8B models
    • AndroidWorld: +15.6% (4B) and +12.2% (8B) over base models
    • Online-Mind2Web: +4.0% (4B) and +8.7% (8B)
    • WebArena-Lite-v2: +12.5% (4B) and +11.3% (8B)
  • Theoretical Analysis: Proves offline-to-online predictability depends on occupancy mismatch C(π) and off-demo validity mass η̄π
  • Data Efficiency: Uses only existing open-source trajectory data, no costly online environment interaction during training

Concepts Covered

Images and Figures

  • Figure 1: Overview of GUI-Libra framework showing data curation, action-aware SFT, and conservative RL pipeline
  • Figure 2: Example data format showing CoT reasoning traces paired with structured JSON actions
  • Figure 3: Data source and action type distributions for SFT/RL datasets
  • Figure 4: Analysis showing negative correlation between response length and grounding accuracy
  • Figure 5: Overall training framework schematic
  • Figures 6-17: Benchmark evaluations, ablation studies, trajectory examples, and case studies

Related Concepts