source: "raw/articles/arxiv-260406126.md"

Summary: Gym-Anything: Turn any Software into an Agent Environment

TL;DR: Researchers introduce Gym-Anything, a framework for automatically converting any software into interactive computer-use environments, resulting in CUA-World—a massive benchmark with 10K+ tasks across 200 software applications covering all major occupational groups.

Key Points

Introduces Gym-Anything framework that automates environment creation through a multi-agent creation-audit loop where agents write setup scripts and verify configurations
Creates CUA-World: 10,000+ interactive tasks across 200+ software applications with GDP-grounded selection covering all 22 SOC occupation groups
Includes CUA-World-Long: 200 challenging long-horizon tasks requiring 500+ steps, where even GPT-5.4 achieves only 27.5% pass rate
Software selection based on U.S. GDP data to prioritize economically valuable applications across healthcare, engineering, finance, and scientific domains
Distillation experiments show 2B model trained on CUA-World outperforms models 2× its size
Introduces Test-Time Auditing that improves Gemini-3-Flash performance from 11.5% to 14.0% on long-horizon tasks
Uses checklist-based VLM verification with privileged information extracted from setup scripts for more reliable evaluation
Performance scales log-linearly with training data (both software count and task count)
Cross-software generalization is limited (22-27% recovery vs 65-87% for seen software)
Framework supports Linux, Windows, and Android environments with containerized execution

Concepts Covered

Computer-Use Agents — Autonomous agents that interact with software through GUI interfaces
Multi-Agent Environment Creation — Using multiple AI agents (creation, audit, memory summarization) to build software environments
GDP-Grounded Software Selection — Methodology for selecting software based on economic impact using O*NET occupational data
Privileged Information Verification — Evaluation technique using ground-truth data from setup scripts that agents don't see during task execution
Test-Time Auditing — Independent agent reviews completed trajectories and provides feedback on missing work
Long-Horizon Task Planning — Tasks requiring hundreds of steps and extended planning capabilities
Trajectory Distillation — Training smaller models using successful trajectories from larger teacher models
Creation-Audit Loop — Iterative process where creation agents build environments and audit agents verify quality
Contamination Filtering — Systematic approach to prevent data leakage between training and test sets
Behavioral Pattern Analysis — Automated analysis of agent trajectories to identify success and failure patterns

source: "raw/articles/arxiv-260406126.md"

Summary: Gym-Anything: Turn any Software into an Agent Environment

Key Points

Concepts Covered

Related Concepts