← Library
source: "raw/articles/webgym-scaling-training-environments-for-visual-web-agents-with-realistic-tasks.md"
Summary: WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
TL;DR: WebGym introduces the largest open-source training environment for visual web agents with ~300k tasks and an asynchronous rollout system that achieves 4-5x speedup, enabling RL-trained agents to reach 42.9% success rate on out-of-domain tasks.
Key Points
- Scale: Contains 292,092 training tasks spanning 127,645 websites, 3x larger than previous environments like Test-Time-Interaction (TTI)
- Task Construction: Uses structured rubric-based evaluation with fact groups to decompose tasks into varying difficulty levels (1-10+ facts)
- Performance: Qwen3-VL-8B-Instruct trained with WebGym achieves 42.9% success rate, outperforming GPT-4o (27.1%) and GPT-5-Thinking (29.8%)
- System Innovation: Asynchronous rollout architecture eliminates synchronization bottlenecks, achieving 4-5x speedup over naive implementations
- Evaluation: Uses GPT-4o for rubric-guided evaluation with 80% human agreement, focusing on evidence-bearing screenshots
- Training Insights: Memory prompts essential for long-horizon tasks; uniform difficulty sampling outperforms hard-biased curricula; shorter training horizons improve efficiency
Concepts Covered
- Visual Web Agents — agents that observe screenshots rather than accessibility trees for web interaction
- Reinforcement Learning — REINFORCE-style training with binary terminal rewards from successful trajectories
- Task Decomposition — systematic breakdown of complex tasks into fact groups for curriculum learning
- POMDP — modeling web navigation as partially observable Markov decision process requiring memory mechanisms
- Asynchronous Systems — server/client architecture eliminating synchronization barriers in multi-step rollouts
- Rubric-based Evaluation — structured evaluation using fact groups rather than simple task descriptions
- Out-of-Distribution Generalization — testing on entirely unseen websites to measure true generalization capability
Images and Figures
2601.02439v5/logo/webgym.png— WebGym logo2601.02439v5/assets/teaser.png— Figure 1: Comparison of agent rollouts showing WebGym vs TTI performance on complex tasks2601.02439v5/x1.png— Figure 2: Task decomposition system showing how fact groups create subtasks2601.02439v5/x6.png— Figure 5: Human-evaluator agreement comparison showing rubric improvements2601.02439v5/x7.pngand2601.02439v5/x8.png— Figure 6: Asynchronous vs synchronous rollout system comparison2601.02439v5/x9.pngand2601.02439v5/x10.png— Figure 7: Rollout system benchmarking results