source: "raw/articles/infiniteweb-scalable-web-environment-synthesis-for-gui-agent-training.md"

Summary: InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

TL;DR: InfiniteWeb automatically generates functional websites with tasks and evaluators at scale to train GUI agents, addressing consistency through unified specifications, correctness through test-driven development, and diversity through design image guidance.

Key Points

  • Core Problem: Training GUI agents is limited by scarcity of suitable environments - existing benchmarks like WebArena and OSWorld are manually constructed with only tens to hundreds of applications
  • Three Key Challenges: Consistency (LLMs generate incompatible implementations across pages), correctness (functional bugs compound in long-horizon tasks), and diversity (repetitive patterns risk agent overfitting)
  • Performance Results: Surpasses commercial coding agents on WebGen-Bench (85.6% vs 81.2% for Codex), improves GUI agent performance from 24.5% to 31.4% on OSWorld with 600 training tasks
  • System Architecture: Four-stage pipeline with unified specification, task-centric backend using test-driven development, design-guided frontend, and automatic evaluator generation
  • Dense Rewards: Generates verifiable evaluators enabling 4.4× more discriminative training tasks through partial credit for intermediate steps
  • Transfer Learning: Training on synthetic web environments improves performance on both web tasks (Online-Mind2Web) and desktop applications (OSWorld)
  • Generation Cost: ~$1.93 per website using GPT-5, median 20 minutes generation time
  • Validation: 95% of generated tasks pass human verification for quality and correctness

Concepts Covered

Images and Figures

  • Figure 1: Shows GUI agent performance scaling with InfiniteWeb training data (24.5% → 31.4% on OSWorld)
  • Figure 2: System overview diagram showing four-stage pipeline with parallel backend/frontend generation
  • Figure 3: Unified Specification Stage workflow from website seed to shared interface design
  • Figure 4: Parallel Task-Centric Backend and Design-Guided Frontend processes
  • Figure 5: LLM-as-Judge visual quality comparison win rates (69-85% vs baselines)
  • Figures 6-7: Ablation studies showing TCTDD importance and dense reward benefits
  • Figures 8-10: Case studies demonstrating cross-domain transfer capabilities (exploration persistence, flow completeness, loop avoidance)

Related Concepts