← Library
source: "raw/articles/arxiv-260406126.md"
Summary: Gym-Anything: Turn any Software into an Agent Environment
TL;DR: Researchers introduce Gym-Anything, a framework for automatically converting any software into interactive computer-use environments, resulting in CUA-World—a massive benchmark with 10K+ tasks across 200 software applications covering all major occupational groups.
Key Points
- Introduces Gym-Anything framework that automates environment creation through a multi-agent creation-audit loop where agents write setup scripts and verify configurations
- Creates CUA-World: 10,000+ interactive tasks across 200+ software applications with GDP-grounded selection covering all 22 SOC occupation groups
- Includes CUA-World-Long: 200 challenging long-horizon tasks requiring 500+ steps, where even GPT-5.4 achieves only 27.5% pass rate
- Software selection based on U.S. GDP data to prioritize economically valuable applications across healthcare, engineering, finance, and scientific domains
- Distillation experiments show 2B model trained on CUA-World outperforms models 2× its size
- Introduces Test-Time Auditing that improves Gemini-3-Flash performance from 11.5% to 14.0% on long-horizon tasks
- Uses checklist-based VLM verification with privileged information extracted from setup scripts for more reliable evaluation
- Performance scales log-linearly with training data (both software count and task count)
- Cross-software generalization is limited (22-27% recovery vs 65-87% for seen software)
- Framework supports Linux, Windows, and Android environments with containerized execution
Concepts Covered
- Computer-Use Agents — Autonomous agents that interact with software through GUI interfaces
- Multi-Agent Environment Creation — Using multiple AI agents (creation, audit, memory summarization) to build software environments
- GDP-Grounded Software Selection — Methodology for selecting software based on economic impact using O*NET occupational data
- Privileged Information Verification — Evaluation technique using ground-truth data from setup scripts that agents don't see during task execution
- Test-Time Auditing — Independent agent reviews completed trajectories and provides feedback on missing work
- Long-Horizon Task Planning — Tasks requiring hundreds of steps and extended planning capabilities
- Trajectory Distillation — Training smaller models using successful trajectories from larger teacher models
- Creation-Audit Loop — Iterative process where creation agents build environments and audit agents verify quality
- Contamination Filtering — Systematic approach to prevent data leakage between training and test sets
- Behavioral Pattern Analysis — Automated analysis of agent trajectories to identify success and failure patterns