source: "raw/articles/arxiv-260406126.md"

Summary: Gym-Anything: Turn any Software into an Agent Environment

TL;DR: Researchers introduce Gym-Anything, a framework for automatically converting any software into interactive computer-use environments, resulting in CUA-World—a massive benchmark with 10K+ tasks across 200 software applications covering all major occupational groups.

Key Points

  • Introduces Gym-Anything framework that automates environment creation through a multi-agent creation-audit loop where agents write setup scripts and verify configurations
  • Creates CUA-World: 10,000+ interactive tasks across 200+ software applications with GDP-grounded selection covering all 22 SOC occupation groups
  • Includes CUA-World-Long: 200 challenging long-horizon tasks requiring 500+ steps, where even GPT-5.4 achieves only 27.5% pass rate
  • Software selection based on U.S. GDP data to prioritize economically valuable applications across healthcare, engineering, finance, and scientific domains
  • Distillation experiments show 2B model trained on CUA-World outperforms models 2× its size
  • Introduces Test-Time Auditing that improves Gemini-3-Flash performance from 11.5% to 14.0% on long-horizon tasks
  • Uses checklist-based VLM verification with privileged information extracted from setup scripts for more reliable evaluation
  • Performance scales log-linearly with training data (both software count and task count)
  • Cross-software generalization is limited (22-27% recovery vs 65-87% for seen software)
  • Framework supports Linux, Windows, and Android environments with containerized execution

Concepts Covered

Related Concepts