title: "arxiv-2604.06126" url: "https://arxiv.org/abs/2604.06126" date: "2026-04-08T16:01:14.685Z" author: "" tags: ["arxiv"] ingested: "2026-04-08T16:01:14.685Z"
Gym-Anything: Turn any Software into an Agent Environment Pranjal Aggarwal CMU Graham Neubig CMU Sean Welleck CMU https://cmu-l3.github.io/gym-anything Abstract Computer-use agents hold the promise of assisting in a wide range of digital eco- nomic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e- commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents. Figure 1: Built with Gym-Anything, CUA-World covers all major occupation groups and indus- tries, spanning over 10K+ long-horizon tasks and environments across 200 software applications, dramatically expanding the scope of computer-use agent evaluation and training. Preprint. arXiv:2604.06126v1 [cs.LG] 7 Apr 2026 1 Introduction Computer-use agents (CUAs, also known as graphical user interface agents) hold the promise of automating and assisting in digitally intensive occupations, which collectively represent trillions of dollars in GDP [46, 45 ]. Yet whether these agents can handle real professional work remains an open question. Real-world workflows are long-horizon and take place in heterogeneous environments, often requiring hundreds of steps across diverse software configured with domain-specific data. For instance, end-to-end analysis of a medical imaging dataset requires a radiology tool set up with annotated clinical CT scans, while reconciling financial records across an enterprise resource planning (ERP) system requires the software populated with transaction histories and vendor accounts. Existing benchmarks shed little light on these capabilities, as they largely test agents on short-horizon tasks such as changing a desktop wallpaper or filling a web form, over a narrow set of consumer-grade applications [53, 62, 23, 7, 37, 11] that only represent a small slice of the economy [51]. This gap has two consequences. First, evaluation is unfaithful: high scores on current benchmarks reveal little about an agent’s ability to operate the software that drives real economic activity. Second, training signal is limited: short-horizon tasks over few applications may not produce the diverse, long-horizon trajectories needed to train agents for real-world work. The root cause is that creating realistic environments is prohibitively expensive: each software requires installation, configuration with domain-appropriate data, task design, and verification, often demanding weeks of expert effort per application [ 53, 56 ]. The core question we aim to address is: how can we scale computer- use environments for training and benchmarking agents in settings closer to real-world work? To address this, we introduce Gym-Anything, a scalable framework for automatically constructing realistic environments across hundreds of economically valuable software applications. At its core, Gym-Anything allows turning any software into an interactable environment. We ground the software selection on U.S. digital GDP data, selecting software based on high economic impact and broad coverage across strategic and STEM domains, different occupations, and industries (Figure 2 (i)). The key idea behind the Gym-Anything framework, similar to agent-driven environment construction explored in other domains [64 , 60 ], is that creating computer-use agent environments is itself a coding and computer-use agent task. Setting up software requires writing installation, configuration, and data-processing scripts, which are coding tasks. Verifying that the environment starts correctly and reaches the expected state requires launching it, taking screenshots, and checking the screen, which are computer-use agent tasks. However, scaling computer-use environment construction to hundreds of types of software requires handling substantial heterogeneity, including different operating systems, databases, and network configurations. To address this, we develop the Gym-Anything library (§2.3), which reduces every environment to a standardized specification: a small set of setup scripts and a configuration file. In turn, the library enables an AI agent to create environments by writing only software-specific scripts rather than dealing with low-level infrastructure. However, without external verification, current agents prompted to create environments in this framework frequently produce incorrect environment setups. The common thread is that the agent’s claims about what it has done are not reliable, but the actual state of the environment is. For instance, a screenshot of the running software reveals whether the environment is working or stuck on a crash screen, regardless of what the agent claims. We exploit this observation through a creation-audit loop (§3; Figure 2 (ii)), in which a creation agent builds an environment and produces evidence of a correct setup through screenshots, execution logs, and data outputs, then an independent audit agent verifies the evidence against a quality checklist and reports issues. In addition, the creation agent builds a shared memory of environment creation strategies that it discovers across attempts, allowing the agent to improve over time. Next, we adopt a propose-and-amplify strategy (Figure 2 (iii); §4) for generating realistic tasks at scale within the software environments. In this pattern, an expensive agentic model proposes a small number of seed tasks per software and runs the tasks, and then a cheaper non-agentic language model amplifies these seeds into a larger set using the seed implementations as in-context examples. To evaluate the resulting tasks, we use a checklist-based VLM verifier that breaks each task into weighted subtasks for partial credit (§4.1). To construct the checklists, we leverage data that is embedded in the environment’s setup scripts (e.g., the correct tumor location from a downloaded medical dataset). Importantly, agents do not have access to this information when solving a task, thereby making it a form of privileged information that the verifier can leverage in order to check the agent’s outputs. 2 Figure 2: Overview of the Gym-Anything pipeline. Phase 1: We select ∼200 economically important software applications grounded in GDP data, balancing high economic impact with broad coverage across occupations, industries, and software categories. Phase 2: Each software is converted into an interactive environment via a creation-audit loop, in which a creation agent iteratively builds and verifies the environment, while an audit agent checks quality over multiple iterations. The creation agent writes its learnings into a memory, allowing it to improve over time. Phase 3: Tasks are scaled with a propose-and-amplify pattern, in which an expensive agentic model creates high-quality seed tasks (e.g., 5 per software), then a cheaper language model generates more tasks (e.g., 75×) using the seeds as in-context examples. Phase 4: Agents are evaluated on CUA-World using a checklist-based VLM verifier with privileged information and fine-grained rubric scores. We use Gym-Anything to construct CUA-World, a collection of over 10,000 tasks across 200 software applications. CUA-World spans domains such as medical science, astronomy, engineering, finance, enterprise systems, and educational platforms. It includes tasks on three operating systems, and is separated into train and test splits (Table 2). To demonstrate the utility of the training split, we distill trajectories from a strong teacher model into a 2B vision-language model and find that performance scales with the number of software and environments in the training set. The trained model also generalizes to software not seen during training, and outperforms models 2× its size. Given the breadth and realism of CUA-World’s software and tasks, we further construct a challenging long-horizon benchmark, CUA-World-Long, consisting of one task per software, with tasks often requiring hundreds of steps. Each task is designed to be realistic and to target common failure modes of existing models. Even the strongest frontier model achieves only 27.5% pass rate on CUA-World- Long, highlighting the difficulty of long-horizon tasks. One common failure mode is that agents often stop early, claiming the task is complete when it is not. To address this, we apply a similar Test-Time Auditing principle (§6.2), where an independent model reviews the agent’s trajectory upon completion and provides feedback on what remains, improving pass rate from 11.5% to 14.0% for Gemini-3-Flash. Although Test-Time Auditing helps, CUA-World-Long remains largely unsolved, offering a new challenging benchmark for frontier computer-use agents on realistic tasks. In summary, we contribute (1) Gym-Anything, a modular framework and multi-agent pipeline for converting any software into an interactive computer-use environment; (2) CUA-World, a collection of 10,000 tasks across a GDP-grounded selection of 200 software applications with checklist-based VLM verification, train/test splits, and a challenging long-horizon split (CUA-World-Long) requiring hundreds of steps; (3) training and test-time scaling results, including distillation to a 2B model that outperforms models 2× its size, and a test-time audit agent that improves long-horizon performance on CUA-World-Long; and (4) a full release of all code, infrastructure, and benchmark data. 3 894 Occupations ONet database 16,600 Softwares 1,400 categories LLM + web search discovery 3,400 Selectable sandbox-ready softwares Filter: free, GUI, self-hostable, no hardware 500 Selected across 22 SOC groups 200 Built compute budget k2b: STEM & Research k2a: Strategic Health, Ed, Safety, Transport) k1 Economic Core k3 Domain Diversity 22/22 SOC groups) k4+k5 Niche + Category Fill (b) Tier Composition (a) Selection Pipeline Balances economic importance (k1) with strategic coverage (k2) and diversity (k3-k5) BLS/BEA GDP mapping Cleanup + Enrichment 5-tier Selection Build EnvironmentsFigure 3: GDP-grounded software selection pipeline. Starting from U.S. occupational data, we estimate per-software GDP, filter to sandboxable candidates, and apply tiered selection to yield 200 software applications. 2 Methodology In this section, we introduce the problem setup, the GDP-grounded software selection procedure, and the library abstraction that makes large-scale environment construction possible. In Section 3, we further discuss our multi-agent strategy to scale the number of software, and in Section 4, discuss how to further scale tasks and environments for the relevant software. 2.1 Problem Setup Environment We refer to an environment E as one or more interactive software applications with a specified initial state of the filesystem and processes running inside an operating system. The agent interacts with the environment through keyboard and mouse actions. Task. A task T = (Es0 , p, V ) consists of an environment E with initial state s0, a natural-language instruction p, and a verification function V that maps the agent’s trajectory to a score. Interaction. At each step t, the agent receives an observation ot (e.g., a screen capture) and executes mouse/keyboard actions at, after which the environment returns a new observation ot+1. An episode proceeds by resetting the environment to s0, letting the agent interact for up to T steps, and the final score is determined by V . 2.2 GDP-Grounded Software Selection Determining which software to include in a CUA training dataset and benchmark is an important design decision. We use a simple principle: prioritize software that drives more economic activity. Unlike prior benchmarks, we ground our selection in publicly available U.S. digital economy data (Figure 3). In a nutshell, we estimate GDP per occupation, discover software used by each occupation, attribute GDP to individual software, filter to sandboxable candidates, apply tiered selection, and based on our compute budget select 200 software applications. We detail each step below. Estimating GDP per occupation. The ONET data on the U.S. economy comprises ∼900 standard- ized occupations, each with publicly available data on employment counts and average wages.1 For 1Occupations from the ONET database [34 ]; employment and wage data from the Bureau of Labor Statistics (BLS) [46]. 4 each occupation, we compute a wage bill (employment × mean wage) and scale it to total GDP using national accounts data [45], yielding a GDP estimate per occupation that sums to the full U.S. GDP (Appendix A). Discovering software per occupation. Next, we need to know what software each occupation actually uses. We use an LLM with web-search access to discover relevant software categories and enumerate software per category for each occupation, producing a catalog of ∼16,600 software applications across ∼1,400 categories. The catalog is cleaned by deduplicating categories, validating software-category assignments, and removing hallucinated entries via web-grounded verification. Attributing GDP to software. Not all of an occupation’s output involves computers, and not all computer work uses the same software. We decompose each occupation’s GDP into a software-level estimate: GDPsoftware = P occ GDPocc × pcomputer × scategory × ssoftware (1) Here, pcomputer is the fraction of the occupation’s work that involves computers (available from occupational surveys), scategory is the share of that computer work attributed to a software category (e.g., “spreadsheets” for an accountant), and ssoftware is the software’s share within its category (e.g., Excel’s share of spreadsheets). The share factors scategory and ssoftware are estimated by an LLM with web-search access. Filtering to sandboxable software. Not all economically important software can be sandboxed into an interactive environment, since they may require paid licenses, organizational credentials, or specialized hardware. We classify each software application as sandbox-ready if it satisfies the following constraints: (a) self-hostable, i.e., does not require an online account to use, (b) free-tier, i.e., can be used free of charge without license restrictions, (c) has a GUI, and (d) does not require hardware that cannot be simulated (e.g., CNC machines). We select ∼3,400 of ∼16,600 that satisfy these constraints. Further, when a software is not sandboxable, we substitute the closest sandboxable alternative from the same software category, aiming to preserve the economic signal. Tiered selection. From the filtered catalog, we select 200 software applications across five tiers that balance economic importance with diversity: (1) highest-GDP software overall, (2a/2b) strategically important domains (Healthcare, Education, Protective Services, Transportation) and STEM domains (Architecture/Engineering, Computer/Math, Life/Physical/Social Science), (3) cycling through all 22 SOC major groups to select ∼5 per group, ensuring every occupation group has representation, (4) software unique to specific occupations or domains not yet covered, and (5) software from uncovered categories, ranked by GDP. We build environments for 200 software applications based on our compute budget, although the pipeline is fully automated and extensible (Appendix A). 2.3 The Gym-Anything Library Constructing computer-use environments across hundreds of diverse software applications requires a unified framework that works across operating systems, application types, and compute backends without per-environment engineering. Previous works have primarily constructed computer-use envi- ronments manually by interacting with an actual operating system, and then taken VM snapshots to be reused later [ 53 ]. However, snapshots cannot be inspected, version-controlled, or partially reused across tasks, and modifying anything requires repeating the manual setup, therefore limiting modu- larity, reproducibility, and scalability. To handle these challenges, we construct the Gym-Anything Library. In Gym-Anything, each environment is defined by a simple specification: three sequential setup scripts and a declarative configuration file. The scripts progress from general to task-specific: in- stall installs the software and its dependencies, configure sets it up with realistic data and settings, and task setup configures the specific starting state for a given task. This separation ensures that multiple tasks for the same software share the same install and configure scripts, varying only the task-specific setup. For example, creating a LibreOffice Calc environment requires only an install script (e.g., apt-get install libreoffice), a configure script that downloads a sample spreadsheet, and a config file specifying the OS image and resource limits; the library handles container orchestration, display forwarding, and checkpoint management automatically. This design reduces environment creation to a scripting task: users and AI agents create new environments by writing setup scripts and key-value configurations, then interact with any environment through a standard gymnasium-style API [ 44 , 8] that provides a unified observation space (e.g., screenshots) and action space (e.g., key- board and mouse inputs), with the library handling display forwarding and input translation across 5 Shared Memory Mgeneral: General patterns, library guidelines Msoftware: Software-Specific Notes Memory Summarization Agent (Agentsumm) Runs every 10 iterations New Software (Sk) Creation Agent (Agentc) bash, python, CUA tools Reads Audit Agent (Agentaudit) quality checklists ✅ Verified Environment after t iterations Writes learnings Environment Audit Report Feedback Creation Audit Loop t iterations Evidence Docs (logs, screenshots, code, data files)Figure 4: The Gym-Anything creation-audit loop. A Creation Agent writes setup scripts and produces evidence documents (screenshots, logs, etc.) while an Audit Agent evaluates this evidence against quality checklists and returns feedback. Learnings accumulate in a shared memory M , which a Summarization Agent periodically condenses so that newer environments are created faster. operating systems. This specification is simple enough that an LLM agent can author environments autonomously (§3), yet expressive enough to capture complex, production-grade software configured with realistic data, ranging from desktop image editors to multi-container enterprise systems. Behind this simple interface, the library manages the complexity of running environments across three operating systems (Linux, Windows, Android) and multiple compute backends (such as docker and apptainer for rootless systems such as slurm). The staged design further enables caching at each stage boundary, so creating new tasks only requires re-running the task-specific setup. Combined with network-process-file isolation, this enables massive parallelization; in our experiments, we run 400+ concurrent environments across 1,600 CPUs (Appendix B). 3 Scaling Computer-Use Agent Software Applications Setting up real-world software as interactable environments is hard, laborious, and time-consuming, even for expert humans [53 , 56, 20]. Each software requires installation, configuration with domain- appropriate data, and verification; for instance, a radiology tool requires annotated clinical CT scans, while an ERP system needs transaction histories and vendor accounts. This often demands weeks of expert effort per application, naturally limiting scalability. The key idea is that setting up computer-use agent environments is itself a coding + computer-use agent task. Because the Gym-Anything library (§2.3) constrains environment creation to a fixed, small interface (writing setup scripts and config files), the creation task becomes a coding task. Further, verifying whether the environment is correctly set up requires launching it and interacting with it, which is a computer-use agent task. However, naively prompting even state-of-the-art agents results in poor environments; the agent stops early, uses fake placeholder data, leaves the software at the wrong starting screen, or claims things are done without actually verifying them. We therefore propose a multi-agent framework that iteratively creates, audits, and improves environments, while accumulating learnings in a shared memory. Multi-agent framework. Each agent in our framework is an instance of Claude Opus 4.5/4.6 [ 3] run via Claude Code, differentiated by a.) access to specific tools, and b.) the objective described by its system prompt. In a nutshell, these agents iteratively generate environments, audit the quality and improve them, and document the learnings for future attempts in a shared memory M . We next describe each of the 3 agents in detail. Creation agent ( AgentC ). This is a coding agent equipped with bash, python, and computer-use (for visual grounding) tools, with complete access to the Gym-Anything library and all previously created 6 environments. Given a new software name Sk, we prompt AgentC with a software-agnostic detailed prompt describing the workflow to follow and library usage, along with Sk, with the objective of implementing the software as an environment and then verifying it by actually running and interacting with it. Before writing any scripts, the agent first researches how the software should be configured, finds and downloads real-world data for the environment (e.g., public medical imaging datasets for radiology software, published email corpora for messaging clients), and studies similar previously created environments. It then implements the setup scripts, launches the environment, takes screenshots, uses visual grounding to check that the application reached the expected state (as intended by the setup scripts), and iteratively debugs failures. Crucially, the agent is required to produce evidence that the software was set up correctly in the form of screenshots of the running software, execution logs, etc. (see Appendix D for an example). However, the agent often declares the task done prematurely. For instance, it may use placeholder data instead of real datasets, leave the software at the wrong screen, or never verify the task by actual execution. We speculate that these failures are due to context fatigue [ 27 , 39 ]: after hundreds of thousands of tokens, the agent loses track of what it still needs to do. To address this, whenever the agent stops, we re-prompt it to reread the setup guidelines, reread the checklists, and complete any requirements it may have skipped. We find this simple technique recovers many omissions. Audit agent ( Agentaudit). While AgentC typically gets the environment running, its claims about what it has done are not always reliable. For instance, it may leave the software at a setup wizard instead of the main screen, use placeholder data, or skip verification entirely. However, the evidence produced above reveals the actual state of the environment regardless of what the agent claims: a screenshot shows whether the software is running correctly or stuck on an error screen. To verify this evidence, we use Agentaudit, a similar coding+computer-use agent that acts as an adversary to AgentC and evaluates whether the evidence demonstrates that the environment satisfies a set of quality checklists (see Appendix D). It does so by analyzing the screenshots and logs, inspecting the actual config and script files, and, if necessary, actually running the environment. Given the implementation of software Sk, the audit agent outputs an audit detailing what is correctly implemented and what the critical issues are (Appendix E contains example audits, and Appendix E.4 shows how issues are corrected across audit rounds). In principle, both AgentC and Agentaudit have access to the same tools and files; the only difference lies in their prompt. We find this separation offers multiple benefits: a.) separating agents removes any self-confirmation bias, and we find audits are more detailed and accurate than self-review (§7.4), b.) the written audits ensure higher interpretability, letting human authors independently verify quality (Appendix E), and c.) the adversarial framing catches cases where AgentC made self-misleading claims. The audit findings are fed back to AgentC for correction, and this loop runs for t iterations. A key feature of our framework is that agents accumulate learnings in a shared memory, allowing them to improve over iterations. We describe this mechanism next. Shared memory. The creation agent maintains a shared memory M , effectively a directory of files that grows over time. M is initialized with the hand-written prompt for AgentC (describing the setup workflow, checklists for verification, and library usage) and evolves as agents add their learnings. After each environment, AgentC documents what it tried, what failed, and what fixed it, updating M in two places: software-specific notes Msoft and general notes Mgeneral that could help future agents. For instance, one agent discovered that a multi-service web platform needed readiness polling before the GUI could launch; once added to M , it became the default for all subsequent web stacks, resulting in faster creation. This ensures sublinear growth in creation time: as more environments are built, newer environments can be created faster. Further, M acts as an asynchronous but shared memory, such that multiple agents running in parallel can write to and read from discoveries made by other agents. However, as more environments are created, Msoft grows large, causing future agents to miss important details due to long contexts. To address this, in every L environment, a memory summarization agent ( Agentsumm) reads through all memory files, finds common patterns, and summarizes findings from Msoft into Mgeneral. This theoretically adds only ∼1/L overhead compared to each agent reading the full Msoft every time. Output. Applying this recipe to the software identified through our GDP-grounded selection (§2.2), we construct environments for 200 software applications across three operating systems (Linux, Windows, Android), ranging from desktop applications to multi-service enterprise systems, each configured with realistic data (public email corpora, medical imaging datasets, financial schemas, 7 and government open data). We select 200 based on our current compute budget; the pipeline is fully automated and extensible to additional software. We next describe how tasks are generated for these environments. 4 Scaling Tasks While §3 addressed the primary bottleneck of getting complex software to run correctly (handling installation, configuration, and background services), generating diverse tasks over these environments poses a separate scaling challenge. Recall from §2.1 that a task requires a starting environment state E(s0), a natural-language instruction p, and a verifier V . Once the base software is configured, creating new tasks reduces to generating these task-specific assets. Nonetheless, naively prompting a model to generate these often results in subpar quality. For instance, setup scripts reference non-existent data, formats mismatch the software’s expectations, and instructions are either trivially simple or impossible to execute from the given starting state. Conversely, relying purely on agentic models to author and validate every task is prohibitively expensive for scaling. To scale task creation efficiently, we propose a propose-and-amplify strategy. First, a proposer agent (Claude Opus 4.5/4.6 via Claude Code, equipped with computer-use tools) proposes a small set of high quality, difficult seed tasks per software. The agent is provided with a set of guidelines for high quality tasks across three dimensions: a.) realism (does the instruction reflect a genuine, real-world use case?), b.) difficulty (does the task require a long-horizon, multi-step trajectory to solve?), and c.) diversity (do the tasks cover varied functionalities of the software?). An agentic loop is necessary here because the model must actively run the software, search, download or generate realistic data, interact via the GUI, and verify the resulting state. Crucially, this expensive step only occurs once per software, ensuring core functionality across relevant occupations identified in §2.2 is covered. Second, for amplification, a non-agentic LLM (Gemini 3 Pro) uses these high-quality seeds as in-context examples to generate additional tasks at scale. While the agentic seeds ensure realism and difficulty in further generated tasks, naively sampling from a non-agentic LLM often yields repetitive or very similar instructions. To enforce diversity, we generate tasks sequentially, providing the model with all previously generated instructions 1, . . . , t as context for task t+1. We subsequently apply semantic similarity filtering to discard duplicate tasks. Finally, because the non-agentic LLM generates tasks without interactive execution, we implement an automated filtering step. We launch each generated task, capture the starting state observation o0, and pass it alongside the instruction to a Vision-Language Model (VLM) to check whether the start state matches the expectation from the task description. Tasks that fail this test are filtered from the dataset. Examples of task descriptions and starting states are provided in Appendix G. 4.1 Task Verification Recall that each task T = (Es0 , p, V ) includes a verification function V that maps the agent’s trajectory to a score (§2.1). Evaluating long-horizon trajectories requires V to be both robust and granular. We construct V as a checklist-based VLM verifier augmented with privileged information. Privileged information. Each task’s starting state s0 is configured by the setup scripts S (the install, configure, and task setup scripts from §2.3). These scripts contain ground-truth data that is not present in the task description p but is deterministically tied to the environment’s configuration. We call this privileged information I = Extract(S, p), extracted automatically by a separate coding agent that parses the scripts, retrieves, or searches online for the relevant ground truth. For instance, in a medical imaging task, the correct tumor location is already known from the downloaded dataset; in a financial task, the expected account balances are determined by the initialization data. Importantly, I assists the VLM verifier rather than making the task artificially harder for the evaluated computer-use agent. Checklist-based verification. The VLM verifier uses I alongside the task instruction p to generate a granular checklist C = {(ci, wi)}N i=1, where each ci defines a specific subtask to verify and wi is its point value. Given the evaluated agent’s trajectory τ , the verification score is: V (τ ) = PN i=1 wi · VLM(τ, ci, I) (2) 8 Table 1: Full-text examples of task descriptions, privileged information, and VLM checklist items.; color only highlights the most important privileged information. Software Task Description Privileged Information VLM Checklist AstroImageJ Analyze the WASP-12 (RA: 06:30:32.79, Dec: +29:40:20.4) astronomical image sequence from January 5-6, 2016, to identify evidence of a planetary transit. If a transit is detected, determine the transit depth, mid-transit time (BJD_TDB or JD), and transit duration in hours. Using a host star radius of 1.599 solar radii, estimate the planet’s radius in Jupiter radii. Save your findings and uncertainties to /Documents/transit_analysis.txt. Target: WASP-12b. Expected Transit Depth: ~1.4% (0.014 relative flux). Expected Duration: ~2.7 hours. Expected Planet Radius: ~1.79 Jupiter radii (calculation: sqrt(0.014) * 1.599 R_sun * conversion factors). The dataset is real ground-based imagery from Jan 5-6, 2016, so the light curve will have noise but the transit dip should be clearly visible. 1. The agent loads the sequence of astronomical images into AstroImageJ. 2. The agent selects the target star (WASP-12) and appropriate comparison stars for differential photometry. 3. The agent generates and displays a light curve plot showing the star’s flux over time. 4. The agent fits a model or trend line to the data to characterize the transit. 5. The agent reports the measured transit depth, mid-transit time, and duration. 6. The agent calculates and reports the planet’s radius in Jupiter radii. Apache OpenOffice Writer You are a Clinical Research Associate (CRA) performing an Interim Monitoring Visit (IMV) for Protocol ZN-994 at Site 142. Using the visit data in /home/ga/Documents/visit_notes.json, create a formal IMV Report in Apache OpenOffice Writer saved as /home/ga/Documents/IMV_ Report_Site_142.odt. The report must include a document header with the Protocol and Site numbers, page numbers in the footer, and sections using ‘Heading 1’ style for ‘Visit Details’, ‘Subject Enrollment’, ‘Protocol Deviations’, and ‘Action Items’. [. . . ] Protocol Number: ZN-994. Site Number: 142. Enrollment counts from the input file: Screened=18, Randomized=14, Completed=3, Discontinued=2. The calculated ‘Active’ count must be exactly 9 (14 - 3 - 2 = 9). The Action Items table contains items AI-02 and AI-03 which have an ‘Open’ status and must be formatted with yellow highlight or red text. The four required sections are ‘Visit Details’, ‘Subject Enrollment’, ‘Protocol Deviations’, and ‘Action Items’. 1. Verify the document has a header with Protocol and Site numbers, and a footer with page numbers. 2. Verify the four required sections are present and use the ‘Heading 1’ style. 3. Verify the Subject Enrollment section contains a table with the correct base counts. 4. Verify the ‘Active’ count in the Subject Enrollment table is correctly calculated. 5. Verify the Protocol Deviations list and Action Items table structure. 6. Verify that Action Items with an ‘Open’ status are conditionally formatted. Aerobridge Calculate the total flight duration in minutes for all flight plans belonging to the operator ‘SkyHigh Surveyors’ that occurred in October 2023. Save the total number of minutes to /home/ga/ Documents/utilization_report.txt. The correct total flight duration for ‘SkyHigh Surveyors’ in October 2023 is 135 minutes (acceptable tolerance: 133-137 minutes). This is calculated from two flights: Flight 1 on Oct 5 (10:00-10:45, 45 mins) and Flight 2 on Oct 12 (14:00-15:30, 90 mins). Distractor flights (wrong operator or wrong month) must be excluded. 1. Locate flight plans belonging to the operator ‘SkyHigh Surveyors’. 2. Identify the flight plans that occurred in October 2023. 3. View the details or times of the relevant flights to calculate duration. 4. Open or create the utilization report file. 5. Save the correct total flight duration to the report file. Liverpool Cancer iChart Using the Liverpool Cancer iChart Archive app, determine the drug-drug interaction between Dabrafenib and Ketoconazole (located in the Antifungal agents category). Leave the application on the screen displaying the interaction result. Dabrafenib is a cancer drug (BRAF inhibitor) and Ketoconazole is an antifungal agent. The interaction between them is clinically significant due to potent CYP3A4 inhibition by ketoconazole, which increases dabrafenib AUC by 71% and Cmax by 33%. The VLM should look for an interaction result indicating a severe/red warning or mentioning CYP3A4 inhibition. 1. The Liverpool Cancer iChart Archive application is opened. 2. Dabrafenib is selected as the cancer drug. 3. The Antifungal agents category is accessed. 4. Ketoconazole is selected as the comedication. 5. The interaction result between Dabrafenib and Ketoconazole is displayed on the screen. where each VLM(τ, ci, I) returns a binary judgment of whether subtask ci was completed, using I to check the agent’s outputs against known ground-truth answers. This formulation allows partial credit on complex, multi-step tasks without requiring manual annotation. Table 1 shows representative examples of privileged information and the concrete checklist items it enables across scientific analysis, clinical reporting, business operations, and clinical decision support. In addition to C, the VLM verifier evaluates a separate integrity checklist Cint to ensure the evaluated computer-use agent did not bypass the intended workflow: a.) the intended software was used to complete the task, b.) the required application state was reached through the software’s own interface rather than by directly editing configuration or data files, and c.) the agent did not exploit environment artifacts to shortcut the work. Failing any integrity item sets V (τ ) = 0, regardless of the task checklist score. We manually compared human-agreement rates of our checklist-based VLM verification against end-state-only VLM verification and programmatic verification, finding the proposed method to be significantly more reliable (§7.3). 5 CUA-World Applying Gym-Anything with our compute budget, the proposer generates 5 and the amplifier 75 tasks per piece of software. After filtering, this yields 12,103 tasks and environments across 200 software applications, each with checklist-based verification (Table 2; Figure 5). As shown in Table 2, CUA-World is the first collection to simultaneously provide interactive environments at scale (200+ varieties of software, 10K+ tasks), support long-horizon evaluation, cover all 22 major occupation groups, offer automated environment creation, and include a training split. We divide CUA-World into Train and Test splits. 9 Table 2: Comparison of CUA-World with datasets and environments for computer-use agents. ✓ yes; × no; ✓∗ partial or with caveats; — not applicable. ⋆ Benchmark allows or requires >100 agent steps per task. ∗ Offline human demonstrations only (not interactive verified trajectories). § Number of 2018 SOC (Standard Occupational Classification) major occupation groups (out of 22 civilian groups) whose workers would routinely use the benchmark’s applications; counted conservatively such that a group is included only if tasks directly simulate work in that occupation. Environment Scale Task Properties Infrastructure Benchmark Agent Interactive Platform # SW # Tasks Long-Horizon⋆ Econ. Cov.§ Auto-Create Train Split Static Datasets Mind2Web [11] Web × Web 137 2,350 — 7/22 × ✓∗ AITW [38] CUA × Android 357+ 715K — 4/22 × ✓∗ AndroidControl [24] CUA × Android 833 15,283 — 7/22 × ✓∗ OmniACT [22] CUA × Lin / Win / macOS / Web 65 9,802 — 6/22 × ✓∗ GDPval [33] LLM ✓ — — 1,320 ✓ 13/22 × × Interactive Benchmarks MiniWob++ [26] Web ✓ Web (sim.) 1 80† × 3/22 × ✓ WebArena [62] Web ✓ Web 6 812 × 5/22 × × VisualWebArena [23] Web ✓ Web 3 910 × 1/22 × × WorkArena [12] Web ✓ Web 1 33 × 3/22 ✓∗ × WorkArena++ [6] Web ✓ Web 1 682† ✓ 3/22 × ✓∗ OSWorld [53] CUA ✓ Linux / Win 9 369 × 3/22 × × AndroidWorld [37] CUA ✓ Android 20 116† × 2/22 × × WindowsAgentArena [7] CUA ✓ Windows 11 154 × 3/22 × × Spider2-V [9] CUA ✓ Linux / Cloud 20 494 × 2/22 × × ScienceBoard [41] CUA ✓ Linux 6 169 × 2/22 × × AssistGUI [18] CUA ✓ Windows 9 100 × 3/22 × × TheAgentCompany [55] CUA ✓ Linux / Web 5 175 × 4/22 × × ProgrammingWithPixels [2] CUA ✓ Linux 1 5400 × 1/22 × × CUA-World (Ours) CUA ✓ Lin / Win / Android / Web 200+ 10,000+ ✓ 22/22 ✓ ✓101 102 log scale CUA-World (Ours) OSWorld WebArena AndroidWorld WindowsAgentArena TheAgentCompany 200 9 6 20 11 5 # Software Products 102 103 104 log scale 10,000 369 812 116 154 175 # Tasks 0 5 10 15 20 linear scale 22 3 5 2 3 4 Occupation Coverage (/22 SOC groups) 0 1 2 3 4 linear scale 4 2 1 1 1 2 OS Platforms Figure 5: Quantitative comparison of CUA-World against existing benchmarks across four dimensions. The first two axes use a log scale. Contamination filtering. To ensure no data leakage between splits, we apply a conservative contamination check. Given two task instructions, we prompt an LLM to grade their similarity on a scale of 1 to 8 (ranging from “not similar” to “duplicate, subset, or superset”). Any pair scoring 4 (“very similar”) or higher is flagged as contaminated. We formalize this by treating tasks as nodes and contamination flags as edges in a similarity graph. We compute the connected components of this graph and randomly assign entire components to either the Train or Test split, ensuring no two tasks across splits contaminate each other. Manual verification shows the pipeline is suitably conservative: it flags several non-contaminated pairs (false positives) but misses very few true instances of contamination (false negatives). For more details, see Appendix H. 10 CUA-World-Long. To evaluate agents on extremely long-horizon tasks, we introduce CUA-World- Long, a set of 200 tasks (one per software). The key challenge is generating tasks that are genuinely harder than those already in the benchmark while remaining solvable. We address this with a trajectory-guided strategy: for each piece of software, we first generate k trajectories from a strong computer-use agent on existing tasks, then prompt a coding and visual agent to analyze these trajectories, specifically identifying why certain tasks have lower pass rates and noting common failure modes. The agent also receives a set of 8 quality guidelines covering real-world relevance, objective evaluability, realistic data, and others (see Appendix F.1). Based on this analysis, the agent creates a new task designed to be harder than the existing ones for that software application. While the agent’s failure assessment is not perfect, the resulting tasks are substantially more difficult. We manually verify that all 200 tasks are set up correctly and are meaningful according to the 8 quality criteria. Further for tasks that fail this verification, we iteratively refine them through further interaction with the agent. The full pipeline is described in Appendix F.2. These tasks often require more than 200 steps for human completion, and current models frequently fail even after 500 steps. 6 Experimental Setup We next describe how we use CUA-World in two roles: as a source of training data for distilling smaller models, and as an evaluation benchmark for computer-use agents. 6.1 Training To evaluate the utility of CUA-World-Train, we distill execution trajectories from a strong teacher model (Kimi-K 2.5 [ 43 ]) into a smaller student model (Qwen3-VL-2B-Thinking [58 ]). For every task in the training split, we generate k = 4 trajectories from the teacher until at least one is correct, and utilize these successful rollouts for fine-tuning. Cumulatively, we collect roughly 2000 trajectories across all 10,000 tasks. We further systematically ablate several design choices in this distillation process on a small set of software, investigating: (1) teacher model selection, (2) the optimal number of steps and samples per trajectory (see §7.4 for results). We then use the best configuration to distill our model on all trajectories. Post-distillation, we evaluate the models on CUA-World-Test alongside external benchmarks such as OSWorld [53]. 6.2 Test-Time Auditing (TTA) Agent Some of the tasks in CUA-World are extremely long. This opens up a unique opportunity to test agents capable of working over extended horizons. However, we find that current agents often stop after a few dozen steps, making mistakes or prematurely claiming the task is complete when it is not. Inspired by our approach in software generation (§3), we introduce an audit agent to address this. Whenever the main model signals that the task is terminated, we run this audit agent. It takes the complete trajectory (all screenshots) as input and determines whether the task is actually complete. Crucially, it does not receive the chain-of-thought from the main model, as we find this biases the auditor’s assessment. If the audit agent determines that the task is not completed, it generates an explanation of what is missing. We provide this feedback back to the main computer-use agent, prompting it to continue completing the task. 6.3 Evaluation We evaluate agents using the checklist-based VLM verifier described in §4.1. Each task’s checklist consists of weighted subtasks; we report two metrics: (1) Average Score (0-100), the mean checklist score across tasks, which captures partial credit, and (2) Pass Rate (%), the fraction of tasks fully completed, i.e., achieving a perfect checklist score. We evaluate on CUA-World-Test (the full test split) and CUA-World-Long (200 long-horizon tasks, one per software). Unless otherwise noted, we use Gemini 3 Flash as the VLM verifier. Each agent is given a maximum budget per episode: 200 steps for CUA-World-Test, and 500 steps or $5, whichever hits first, for CUA-World-Long. For GPT-5.4 and Claude Sonnet 4.6, we use their official agent harnesses. For Gemini 3 Flash and Kimi-K 11 2.5, official harnesses were not publicly available at the time of our experiments, so we adapted the Qwen3-VL harness (Appendix K.2). 7 Results and Analysis 7.1 Main Results Table 3: Model performance on CUA-World- Test. Our 2B distilled model outperforms open- source models up to 2× its size. Model Avg. Score Pass Rate Large Models Gemini 3 Flash 50.1 22.6 Kimi-K 2.5 37.1 12.8 Small Models Qwen3-VL-2B 12.7 1.6 Qwen3-VL-4B 19.3 3.9 Ours (2B distilled) 22.5 (+9.8) 4.4 (+2.8) Distillation on CUA-World-Train yields a strong 2B model. Table 3 shows results on CUA-World- Test. We evaluate four frontier models: Gemini- 3-Flash and Kimi-K 2.5, Claude-Sonnet-4.6, and GPT-5.4, along with several small models. Gemini-3-Flash is strongest with 50.1 average score and 22.6% pass rate, followed by Kimi-K 2.5 with 37.1 and 12.8%. On the other extreme, small models perform very poorly: Qwen3-VL-2B achieves only 1.6% pass rate while Qwen3-VL- 4B achieves 3.9%. Distillation on CUA-World- Train trajectories shows significant improvements, boosting the pass rate of Qwen3-VL-2B from 1.6% to 4.4%, outperforming Qwen3-VL-4B, a model 2× its size. This demonstrates that CUA-World-Train provides a useful supervision signal for improving small models. Table 4: Performance on CUA-World- Long. Model Avg. Score Pass Rate Max 500 steps, $5 cost cap Gemini 3 Flash 36.2 7.5 GPT-5.4 22.7 3.0 Sonnet 4.6 20.5 6.0 Kimi-K 2.5 33.9 5.5 Max 2000 steps, no cost cap Gemini 3 Flash 38.7 11.5 GPT-5.4 55.5 27.5 CUA-World-Long is challenging for frontier mod- els. Table 4 shows the performance of multiple fron- tier models on CUA-World-Long. Even the strongest model, Gemini-3-Flash, achieves a pass rate of only 7.5% and an average score of 36.2. Interestingly, GPT- 5.4 achieves 3% pass rate while Claude-Sonnet-4.6 achieves 6%. This is partly because they exhaust their $5 budget in roughly 150 steps (≤100 for GPT-5.4), much less than Gemini-3-Flash. To test whether bud- get is a bottleneck, we remove the cost cap and raise the step limit to 2,000 for GPT-5.4 and Gemini-3- Flash (Table 4, lower half). Both models substantially improve, notably GPT-5.4 reaching 27.5% pass rate. However, these improvements come at a substantial test-time cost. On average, Gemini-3-Flash requires 1,300 steps and approximately $16 per trajectory, while GPT-5.4 requires 242 steps and approximately $18 per trajectory. These results highlight that improvements in model capabilities are needed before agents can reliably and efficiently handle the long-horizon, multi-step workflows that CUA-World-Long demands. Scaling Software Applications and Environments: Performance scales with both increasing software and task count. Figure 6a shows how the distilled 2B model’s score on CUA-World-Test changes as we scale the training data along two axes: the number of software applications (50, 100, 200) while keeping all tasks per software, and the fraction of tasks (25%, 50%, 100%) across all 200 software applications. Both curves show consistent performance improvements, following a roughly log-linear trend of ∼3.5 point increase on doubling the data. This suggests further scaling our Gym-Anything pipeline could yield an even stronger model. Generalization: Distillation improves performance on both seen and unseen software, but gains are larger on seen software. To study how distillation generalizes to software not seen during training, we train models on 25% and 50% of the 200 software applications and evaluate separately on the software used during training (IID) and those that are not used (OOD) (Figure 7). Performance improves on both: on IID software, the average score increases from 16.7 to 24.2 (at 25% of software), and on OOD software from 12.3 to 14.1. However, the OOD gain is limited; Figure 7 shows it recovers only 22-27% of the improvement one would obtain from training on all software, compared to 65-87% on IID software. This suggests that generalization to unseen software does happen but is 12 0% 25% 50% 100% Fraction of Training Data 10.0 12.5 15.0 17.5 20.0 22.5 25.0 Avg. Score on CUA-World-Test 12.7 16.3 14.6 18.3 17.6 22.5 Untrained # Software # Tasks(a) Training data scaling.50 100 200 500 1K 2K Average Steps Taken per Task 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Pass Rate (%) 2.0% 2.5% 6.5% 7.5% 11.5% 14.0% Gemini 3 Flash + Test-Time Auditing (b) Test-time compute scaling. Figure 6: Scaling behavior on CUA-World. (a) Training data scaling on CUA-World-Test: varying the number of software (50, 100, 200) or the fraction of tasks (25%, 50%, 100%). Both axes improve with scale, following a roughly log-linear trend. (b) Test-time compute scaling on CUA-World-Long: pass rate as a function of average steps taken per task, where each point corresponds to a different maximum step budget (50, 100, 200, 500, 2,000 steps). The star indicates Test-Time Auditing (TTA, §6.2) under the same 2,000-step cap.IID25% OOD25% 10 15 20 25 30 Avg. Score Trained on all (28.2) Untrained (16.7) Trained on all (18.9) Untrained (12.3) Trained on 25% Software 65% 27% IID50% OOD50% 10 15 20 25 30 Trained on all (22.2) Untrained (13.2) Trained on all (22.8) Untrained (12.3) Trained on 50% Software 87% 22% Recovered Remaining gap Figure 7: Generalization to seen (IID) vs. unseen (OOD) software. We train models on 25% (left) and 50% (right) of the 200 software applications, and evaluate on the training software applications (IID) and the held-out software applications (OOD). Each bar spans from the untrained baseline (bottom) to the model trained on all software (top). The solid portion shows the gain recovered by the model trained on the subset; the hatched portion shows the remaining gap. Training on a subset recovers 65-87% of the gain on IID software but only 22-27% on OOD software, indicating that generalization to unseen software is limited and scaling to diverse software is important. limited. Secondly, since recovery on IID software ranges from 65%-87% this implies training on a specific software helps substantially, but training on other software also contributes in the evaluated software’s performance. Overall, this underscores that building agents for the large variety of software used in the digital economy requires training that is both software-specific and across a diverse set of software, motivating the need for scalable environment creation pipelines such as Gym-Anything. Larger models may generalize better across software, which we leave to future work. 7.2 Scaling Test-Time Compute Pass rate scales with step budget, and Test-Time Auditing provides further gains. Figure 6b shows how Gemini 3 Flash’s pass rate on CUA-World-Long changes as we increase the maximum step budget per episode, where each point represents a budget of 50, 100, 200, 500, 2,000 steps. Pass rate stays low between 50 and 100 average steps (2.0% → 2.5%), then rises steeply at higher budgets (6.5% at ∼200 steps, 7.5% at ∼400 steps). The sharp jump likely suggests that most 13 0.5 0.6 0.7 Failed Passed 0 100 200 300 400 500 Avg. Number of Steps per Task 0.00 0.01 0.02 0.03 0.04 Mean: 426 Fraction of Tasks(a) Step distribution.0 10 20 30 40 50 60 70 80 90 100 Avg. Checklist Score per Task 0.000 0.005 0.010 0.015 0.020 0.025 0.030 Density Mean: 44 Mean: 35 Gemini 3 Flash Kimi-K 2.5 (b) Difficulty distribution. Figure 9: Properties of CUA-World-Long. (a) Distribution of average steps per task. The y-axis is broken to accommodate the spike at the 500-step cap. (b) Distribution of per-task average checklist scores for Gemini 3 Flash and Kimi-K 2.5. CUA-World-Long tasks require a minimum number of steps (>100) before the agent can complete them at all. Increasing compute beyond that continues to help, reaching 11.5% at ∼1,300 average steps. Further, the TTA agent uplifts performance even further, raising the pass rate to 14.0% under the same 2,000-step cap. Since the maximum step budget remains the same, this likely implies that TTA helps when the model stops prematurely, as the auditor is able to verify the trajectory and provide feedback on any missed subtasks. 7.3 Benchmark AnalysisRetry Loops (step fraction) UI Exploration (step fraction) Verification Checks (presence rate) 0.0 0.2 0.4 0.6 0.8 1.0 Fraction 78% 76% 70% 39% 67% 91% Failed Passed Figure 8: Behavioral patterns in passed vs. failed trajectories across Gemini-3-Flash evalu- ations on CUA-World. See Appendix for the full set of 15 patterns. Trajectory Behavioral Patterns: Failed tra- jectories are dominated by retry loops, while successful ones verify their progress more often. To understand how agents behave on CUA-World, we analyze all trajectories from Gemini-3-Flash evaluated on CUA-World, using an automated be- havioral analysis pipeline (Appendix M). We fo- cus on Gemini-3-Flash here, and note that patterns may differ across models. We first obtain per- trajectory behavioral summaries via an LLM, then aggregate these across all trajectories to discover recurring behavioral patterns, yielding 15 canon- ical patterns. Figure 8 highlights three patterns. Retry loops show the largest gap: failed trajecto- ries spend 78% of their steps retrying actions that did not take effect, compared to 39% for passed trajectories. UI exploration is high for both outcomes (76% vs. 67%), indicating that the majority of agent effort across all trajectories is spent navigating menus and locating the right controls rather than executing the core task. Verification checks, where the agent re-inspects its work after making changes, are present in 91% of passed trajectories but only 70% of failed ones, suggesting an associa- tion between self-verification and task success. This observation provides empirical motivation for the Test-Time Auditing approach (§6.2). Step Distribution on CUA-World-Long: Most failed trajectories exhaust the step budget, while passed ones finish at varying lengths. Figure 9a shows the distribution of average steps per task on CUA-World-Long. Failed tasks have a large spike at the 500-step cap, indicating that many episodes keep running until the budget is exhausted rather than failing immediately. Passed tasks are spread across a wide range of lengths, including many tasks that still require several hundred steps. The overall mean is 425 steps, highlighting the long-horizon nature of CUA-World-Long. Difficulty Distribution on CUA-World-Long: CUA-World-Long spans a wide difficulty range. Figure 9b shows the distribution of per-task average checklist scores on CUA-World-Long for the two 14 Low Medium High Visual Complexity 0 5 10 15 20 25 30 Pass Rate (%) 25.3 20.9 21.6 14.8 10.2 14.3 3.2 0.9 0.0 7.6 2.7 1.3 (a) Visual Complexity General Specialized Domain Knowledge 0 5 10 15 20 25 30 25.6 19.9 15.2 10.5 2.7 0.6 6.4 2.2 (b) Domain Knowledge Gemini 3 Flash Kimi-K 2.5 Qwen3-VL-2B Ours (2B)Figure 10: Pass rate on CUA-World-Test by software category. (a) Visual complexity and (b) do- main knowledge. See Appendix I for category definitions and assignment of software to categories. strongest models (Gemini 3 Flash, mean 44; Kimi-K 2.5, mean 35). Outside of the 0-10 bin, scores are spread fairly evenly across the range, indicating that CUA-World-Long contains tasks at every difficulty level rather than being split into trivially easy and impossible ones. The most notable feature is a large spike at 0-10: roughly a quarter of tasks for Gemini and a third for Kimi receive near-zero scores, indicating complete failure on a substantial fraction of tasks even for the strongest models. Performance by Software Category: High visual complexity is a persistent bottleneck for smaller models. We classify each software along two axes: visual complexity (low/medium/high) and domain knowledge (general/specialized); see Appendix I for definitions. Figure 10 shows pass rates on CUA-World-Test broken down by each axis. For visual complexity (a), larger models (Gemini 3 Flash, Kimi-K 2.5) achieve roughly consistent pass rates across all three levels (e.g., 25.3%, 20.9%, 21.6% for Gemini). In contrast, smaller models show a steep decline: Qwen3-VL-2B drops from 3.2% on low-complexity software to 0.0% on high-complexity software. Distillation improves absolute performance at every level (e.g., 0.0% → 1.3% on high, 3.2% → 7.6% on low), but the decline from low to high remains steep, indicating that visual complexity creates a disparity for small models that distillation alone does not resolve. For domain knowledge (b), all models show a downward trend from general to specialized software, with smaller models showing a steeper decline (∼3× for our 2B model: 6.4% → 2.2%) than large models (∼1.3× for Gemini: 25.6% → 19.9%). Verifier Robustness We evaluate the robustness of the verifier across two dimensions: a.) how well does it agree with humans in terms of correctness, and b.) integrity checks: how often the integrity checklist Cint correctly identifies shortcut behavior. Checklist-based verification achieves highest human agreement. We compare three verifier designs on 60 randomly sampled Gemini-3-Flash trajectories from CUA-World-Test: (1) our checklist- based VLM verifier (§4.1), (2) a direct VLM verifier that receives the trajectory and outputs a single pass/fail judgment, and (3) programmatic verifiers where a model generates a script that runs on the end state and computes a score. The checklist-based verifier achieves 93.3% task-level agreement with human annotations, compared to 81.7% for the direct VLM verifier and 43.3% for programmatic verifiers. Per-item checklist agreement is 90.9%. The programmatic approach performs poorly primarily because the model writes incorrect scripts that fail to parse the data formats present in the end state; manually authored programmatic verifiers could provide stronger guarantees and are an interesting direction for future work. Overall, we use the checklist-based VLM verifier with privileged information for all experiments. Integrity checks catch shortcuts at a low flag rate. Across ∼3,000 Gemini-3-Flash trajectories, the integrity checklist flags only ∼1.5% of high-scoring runs (score >75), producing 21 flags total, 15 of which 15 are true positives. We describe two representative cases. In a forensic analysis task on Autopsy (digital forensics tool), the agent followed the correct workflow but fabricated hash values in its final report rather than copying the values visible in the application. In a statistical analysis task on Epi Info (epidemiology toolkit), the agent mistyped an input parameter, causing the tool to display an incorrect result, but wrote the mathematically correct answer in its report, a value never shown by the tool. In both cases, the agent scored high on task completion but was zeroed by the integrity check. Additional examples and a detailed breakdown are provided in Appendix C.4. 7.4 Gym-Anything Pipeline Ablations Distillation Ablations: We ablate the teacher model, student model, and number of training steps to identify the best configuration for full distillation. Table 5: Teacher model selection on 4 software applications. Teacher Student Score Teacher Score Q3-VL-2B Q2.5-3B Opus 4.5 53.5 19.3 8.5 Sonnet 4.5 45.5 17.5 9.8 Gemini 3 Flash 44.0 16.3 8.3 Kimi-K 2.5 39.8 25.3 15.8 Gemini 3 Pro 39.3 15.8 7.0 The strongest teacher does not produce the strongest student. Table 5 compares five teacher models distilled into two student archi- tectures (Qwen3-VL-2B and Qwen2.5-3B) on 4 software. Opus 4.5 is the strongest teacher (53.5 avg. score) while Kimi-K 2.5 is one of the weakest (39.8). However, Kimi-K 2.5 pro- duces the best student for both model sizes: 25.3 vs. 19.3 (Opus) for Qwen3-VL-4B, and 15.8 vs. 9.8 (Sonnet) for Qwen2.5-3B. One possible explanation is that, unlike other mod- els, Kimi-K 2.5 is open-source and provides full reasoning chains; however, other factors may contribute as well [57]. Table 6: Effect of training trajectory length. Train GIMP G. Earth OpenEMR Slicer 3D Avg. 200 steps 60.1 13.4 18.2 3.3 23.8 50 steps 64.2 13.4 8.5 4.5 22.7 No distillation Q3-VL-2B 47.0 7.2 0.0 0.0 13.6 Effect of training trajectory length. Table 6 compares training on the first 50 vs. all 200 steps of each teacher trajectory under the same $25-per-software budget. On average, the two settings perform similarly (22.7 vs. 23.8), but the per-software pattern differs: training on 50 steps wins on GIMP and Slicer 3D, while training on 200 steps is better on OpenEMR, likely because its tasks require longer-horizon interaction. Based on this, we adopt a two- stage curriculum for full distillation: first train with a maximum step budget of 50, then continue on full 200-step trajectories, with equal budget for each stage. Propose-and-Amplify Ablation: Proposal step substantially improve amplified task quality. To evaluate the propose-and-amplify strategy (§4), we compare tasks generated with and without seed examples, using proposal step, across 10 software applications. We launch each generated task and use a VLM to verify whether the starting state matches the task description. Tasks amplified from seed examples achieve an 88.9% setup success rate, compared to 55.2% without seeds. Qualitative analysis on three software applications (Firefox, AstroImageJ, Moodle) reveals that without seeds, the model defaults to demonstrating software features rather than generating realistic professional workflows, produces shorter-horizon tasks, and writes less thorough setup scripts (Appendix L). Cross-Model Auditing: Using a separate model for Agentaudit catches more issues than self- auditing. In §3, we argued that separating AgentC and Agentaudit removes self-confirmation bias. To test this, we compare audits where the same model serves as both AgentC and Agentaudit (self-audit) against audits where a different model serves as Agentaudit (cross-model audit) across 10 software applications. Both configurations detect all critical issues, but cross-model audits consistently surface additional problems that self-audits miss. For example, on OpenELIS, the self-audit accepts patient data as realistic, while the cross-model audit inspects the seeding script and identifies the data as hardcoded despite comments claiming real-world WHO/CDC sourcing. Across 10 software applica- tions, cross-model audits identify on average 2.1 additional issues per environment, predominantly low-to-moderate severity. We present three representative comparisons in Appendix E.5. 16 Additional analysis. The appendix contains further results and analysis. We qualitatively show how the creation-audit loop iteratively corrects issues across rounds, with before-and-after examples (Appendix E.4). We verify that CUA-World covers all 22 SOC major occupation groups (Appendix J). We apply an automated behavioral analysis pipeline to ∼3,000 trajectories, discovering 15 canonical patterns and comparing their prevalence in passed vs. failed runs (Appendix M). We detail the contamination filtering statistics (Appendix H) and provide 12 representative task examples with agent trajectories (Appendix G). Finally, the CUA-World-Long generation pipeline and a trajectory analysis example are in Appendix F. 8 Related Work Benchmarks for computer-use agents. Prior benchmarks for computer-use agents are either static or interactive but small-scale (Table 2; Figure 5). Static datasets [ 11, 38 , 24 , 22 ] collect thousands of episodes but evaluate via action-matching rather than execution, penalizing valid alternative strategies. Interactive benchmarks provide execution-based evaluation but cover narrow slices of the software landscape: web benchmarks [26 , 62, 23 , 12 , 6] are restricted to a few websites, desktop benchmarks [53 , 7, 9, 18 , 55 , 41 , 2] span at most a handful of applications with manually authored environments, and AndroidWorld [ 37 ] covers 20 apps. Critically, all interactive benchmarks rely on manual environment creation, limiting their scale, and none simultaneously provides training data, long-horizon tasks, or broad occupational coverage. CUA-World addresses these gaps through automated environment creation, yielding 10K+ interactive tasks across 200+ software applications on four platforms, with train/test splits, long-horizon evaluation, and GDP-grounded coverage of all 22 SOC occupation groups. Automated environment and task generation. Several works generate tasks or trajectories within pre-existing environments [56 , 40 , 63, 29 , 35 ], but cannot create new ones. LLM-based environment generation has been explored for text planning [ 21], embodied AI [ 59 ], tool-use APIs [ 50 ], code editing and SWE training [ 31 , 64 ], and text-based simulations [60 ], but not for real GUI software requiring installation, configuration, and realistic data. Concurrently, GUI-GENESIS [10 ] synthesizes lightweight web replicas from interaction traces of a single app ecosystem for efficient RL training, but does not install or configure real software, handle multi-OS environments, or target long-horizon evaluation. The seed-then-amplify paradigm [ 49, 54, 28 ] is effective for generating instruction data at scale, but targets text pairs rather than executable environment tasks. Gym-Anything combines all three: a creation-audit loop that converts real software into interactive environments via coding agents verified by an independent auditor, a propose-and-amplify strategy that generates tasks grounded in actual software execution, and a shared memory that accumulates learnings across environments. Evaluation of computer-use agents. Existing benchmarks predominantly use hand-written program- matic verifiers that check the final system state [ 53 , 62], which are reliable but labor-intensive and offer only binary pass/fail. VLM-based evaluation has been explored for filtering training trajectories [56], step-level trajectory assessment [ 42], and autonomous evaluation of agent trajectories [32 ], but these approaches lack access to ground-truth answers and cannot detect workflow shortcuts. Our checklist- based VLM verifier addresses both gaps by incorporating privileged information extracted from environment setup scripts, enabling verification against known answers without per-task code, and by including integrity checks that detect workflow bypasses such as fabricated outputs or tool misuse. We provide an extended related work with additional coverage of training methods, economic grounding, and detailed per-benchmark comparisons in Appendix N. 9 Conclusion In this work, we introduced Gym-Anything, a scalable framework for converting arbitrary software into interactive computer-use environments. By reducing environment creation to setup scripts and configuration files, and by framing creation itself as a multi-agent loop of generation, auditing, and correction, Gym-Anything addresses a central bottleneck in computer-use agents: the difficulty of constructing realistic environments at scale. Applying this framework, we built CUA-World, a GDP-grounded collection of over 10K tasks across 200 software applications spanning diverse occupations, domains, and operating systems, together with checklist-based VLM verification and train/test/long-horizon splits. We further showed that CUA-World provides useful supervision for 17 training smaller agents through distillation, and that test-time auditing can improve performance on especially long-horizon tasks. At the same time, CUA-World-Long is challenging even for frontier models, indicating that realistic computer-use remains far from solved. More broadly, our results suggest that progress in computer-use agents will require not only stronger models, but also scalable methods for constructing the environments and tasks on which those models are trained and evaluated. We hope that Gym-Anything, CUA-World, and the released code and infrastructure provide a foundation for future work on long-horizon, economically relevant computer- use, including more capable agents, stronger verifiers, and broader coverage of the software that underlies real-world digital work. 10 Acknowledgements Pranjal is supported by a SoftBank Group-Arm Fellowship. This work was supported in part by the National Science Foundation under Grant Nos. DMS-2434614 and DMS-2502281, a gift from Convergent Research and a grant of compute credits from Microsoft Azure. 11 Limitations Our GDP-grounded software selection is designed to produce a reasonable ranking of which software matters more, not a precise dollar-level attribution. We use the strongest available LLM with web- search access to estimate the share factors in Equation 1, but other methods may yield more accurate estimates. While we specifically select the closest sandboxable alternative for software that cannot be freely sandboxed (e.g., due to licensing), a large fraction of professionally used software remains excluded, and the degree to which performance on free alternatives predicts performance on their commercial counterparts is an open question. While we manually verified that every software environment launches correctly and that every CUA-World-Long task loads with the correct starting state and data, we did not solve all tasks end-to-end, and therefore cannot guarantee that every task is solvable. Creating a fully human-verified version of the benchmark is an interesting direction for future work. Finally, we use VLM checklist verifiers throughout our evaluation pipeline. Manual annotation shows high human agreement, but like any evaluation method, VLM verifiers are imperfect and may be susceptible to adversarial exploitation. Developing robust programmatic verifiers, each manually vetted per task, would complement the current approach and is another promising direction. 12 Ethics Statement We acknowledge that computer-use agents may pose risks if deployed autonomously. While this work introduces methods for environment creation and test-time auditing, it does not train or release a model that exceeds existing frontier capabilities. All software used is freely available, and all datasets were obtained from public sources or synthetically generated. 18 References [1] Daron Acemoglu. The simple macroeconomics of ai. SSRN Electronic Journal, 2024. [2] Pranjal Aggarwal and Sean Welleck. Programming with pixels: Can computer-use agents do software engineering? arXiv preprint arXiv:2502.18525, 2025. [3] Anthropic. The claude model family, 2025. [4] Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024. [5] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923, 2025. [6] Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. WorkArena++: Towards compositional planning and reasoning-based common knowledge work tasks. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 5996–6051. Curran Associates, Inc., 2024. [7] Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 4874–4910. PMLR, 13–19 Jul 2025. [8] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016. [9] Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, and Tao Yu. Spider2-V: How far are multimodal agents from automat- ing data science and engineering workflows? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 107703–107744. Curran Associates, Inc., 2024. [10] Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, Wei Yang, and Tao Xie. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026. [11] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023. [12] Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 11642–11662. PMLR, 21–27 Jul 2024. [13] Ramy ElMallah, Krish Chhajer, and Chi-Guhn Lee. Score the steps, not just the goal: Vlm-based subgoal evaluation for robotic manipulation, 2025. 19 [14] Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look at the labor market impact potential of large language models, 2023. [15] Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses. Strategic Management Journal, 42(12):2195–2217, 2021. [16] Carl Benedikt Frey and Michael A. Osborne. The future of employment: How susceptible are jobs to computerisation? Technological Forecasting and Social Change, 114:254–280, 2017. [17] Jonathan Gabor, Jayson Lynch, and Jonathan Rosenfeld. Evilgenie: A reward hacking bench- mark, 2025. [18] Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, and Mike Zheng Shou. AssistGUI: Task-oriented PC graphical user interface automation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13289–13298, June 2024. [19] Yanheng He, Jiahe Jin, and Pengfei Liu. Efficient agent training for computer use, 2026. [20] Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, and Pengfei Liu. PC agent: While you sleep, AI works – a cognitive journey into digital world. arXiv preprint arXiv:2412.17589, 2024. [21] Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jianguang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation, 2025. [22] Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniACT: A dataset and benchmark for enabling multi- modal generalist autonomous agents for desktop and web. In Computer Vision – ECCV 2024, pages 161–178. Springer Nature Switzerland, 2024. [23] Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 881–905, Bangkok, Thailand, August 2024. Association for Computational Linguistics. [24] Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on UI control agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 92130–92154. Curran Associates, Inc., 2024. [25] Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024. [26] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018. ICLR 2018; arXiv:1802.08802. [27] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024. [28] Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah. Agentinstruct: Toward generative teaching with agentic flows, 2024. 20 [29] Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, and Kenton Lee. Bagel: Bootstrapping agents by guiding exploration with language, 2024. [30] Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Awadallah. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents, 2025. [31] Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. [32] Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Au- tonomous evaluation and refinement of digital agents, 2024. [33] Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. Gdpval: Evaluating ai model performance on real-world economically valuable tasks, 2025. [34] Norman G. Peterson, Michael D. Mumford, Walter C. Borman, P. Richard Jeanneret, Edwin A. Fleishman, Kerry Y. Levin, Michael A. Campion, Melinda S. Mayfield, Frederick P. Morgeson, Kenneth Pearlman, Marilyn K. Gowing, Anita R. Lancaster, Marilyn B. Silver, and Donna M. Dye. Understanding work using the occupational information network (ONET): Implications for practice and research. Personnel Psychology, 54(2):451–492, 2001. [35] Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025. [36] Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, and Guang Shi. Ui-tars: Pioneering automated gui interaction with native agents, 2025. [37] Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A dynamic benchmarking environment for autonomous agents. In The Thirteenth International Conference on Learning Representations, 2025. ICLR 2025; arXiv:2405.14573. [38] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidInTheWild: A large-scale dataset for Android device control. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 59708–59728. Curran Associates, Inc., 2023. [39] Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms, 2026. [40] Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis, 2025. [41] Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. ScienceBoard: Evaluating multimodal autonomous agents in realistic scientific workflows. In The Fourteenth International Conference on Learning Representations, 2026. ICLR 2026; arXiv:2505.19897. 21 [42] Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025. [43] Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen, Dazhi Cheng, Minghan Chu, Jialei Cui, Jiaqi Deng, Muxi Diao, Hao Ding, Mengfan Dong, Mengnan Dong, Yuxin Dong, Yuhao Dong, Angang Du, Chenzhuang Du, Dikang Du, Lingxiao Du, Yulun Du, Yu Fan, Shengjun Fang, Qiulin Feng, Yichen Feng, Garimugai Fu, Kelin Fu, Hongcheng Gao, Tong Gao, Yuyao Ge, Shangyi Geng, Chengyang Gong, Xiaochen Gong, Zhuoma Gongque, Qizheng Gu, Xinran Gu, Yicheng Gu, Longyu Guan, Yuanying Guo, Xiaoru Hao, Weiran He, Wenyang He, Yunjia He, Chao Hong, Hao Hu, Jiaxi Hu, Yangyang Hu, Zhenxing Hu, Ke Huang, Ruiyuan Huang, Weixiao Huang, Zhiqi Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yu Jing, Guokun Lai, Aidi Li, C. Li, Cheng Li, Fang Li, Guanghe Li, Guanyu Li, Haitao Li, Haoyang Li, Jia Li, Jingwei Li, Junxiong Li, Lincan Li, Mo Li, Weihong Li, Wentao Li, Xinhang Li, Xinhao Li, Yang Li, Yanhao Li, Yiwei Li, Yuxiao Li, Zhaowei Li, Zheming Li, Weilong Liao, Jiawei Lin, Xiaohan Lin, Zhishan Lin, Zichao Lin, Cheng Liu, Chenyu Liu, Hongzhang Liu, Liang Liu, Shaowei Liu, Shudong Liu, Shuran Liu, Tianwei Liu, Tianyu Liu, Weizhou Liu, Xiangyan Liu, Yangyang Liu, Yanming Liu, Yibo Liu, Yuanxin Liu, Yue Liu, Zhengying Liu, Zhongnuo Liu, Enzhe Lu, Haoyu Lu, Zhiyuan Lu, Junyu Luo, Tongxu Luo, Yashuo Luo, Long Ma, Yingwei Ma, Shaoguang Mao, Yuan Mei, Xin Men, Fanqing Meng, Zhiyong Meng, Yibo Miao, Minqing Ni, Kun Ouyang, Siyuan Pan, Bo Pang, Yuchao Qian, Ruoyu Qin, Zeyu Qin, Jiezhong Qiu, Bowen Qu, Zeyu Shang, Youbo Shao, Tianxiao Shen, Zhennan Shen, Juanfeng Shi, Lidong Shi, Shengyuan Shi, Feifan Song, Pengwei Song, Tianhui Song, Xiaoxi Song, Hongjin Su, Jianlin Su, Zhaochen Su, Lin Sui, Jinsong Sun, Junyao Sun, Tongyu Sun, Flood Sung, Yunpeng Tai, Chuning Tang, Heyi Tang, Xiaojuan Tang, Zhengyang Tang, Jiawen Tao, Shiyuan Teng, Chaoran Tian, Pengfei Tian, Ao Wang, Bowen Wang, Chensi Wang, Chuang Wang, Congcong Wang, Dingkun Wang, Dinglu Wang, Dongliang Wang, Feng Wang, Hailong Wang, Haiming Wang, Hengzhi Wang, Huaqing Wang, Hui Wang, Jiahao Wang, Jinhong Wang, Jiuzheng Wang, Kaixin Wang, Linian Wang, Qibin Wang, Shengjie Wang, Shuyi Wang, Si Wang, Wei Wang, Xiaochen Wang, Xinyuan Wang, Yao Wang, Yejie Wang, Yipu Wang, Yiqin Wang, Yucheng Wang, Yuzhi Wang, Zhaoji Wang, Zhaowei Wang, Zhengtao Wang, Zhexu Wang, Zihan Wang, Zizhe Wang, Chu Wei, Ming Wei, Chuan Wen, Zichen Wen, Chengjie Wu, Haoning Wu, Junyan Wu, Rucong Wu, Wenhao Wu, Yuefeng Wu, Yuhao Wu, Yuxin Wu, Zijian Wu, Chenjun Xiao, Jin Xie, Xiaotong Xie, Yuchong Xie, Yifei Xin, Bowei Xing, Boyu Xu, Jianfan Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinbo Xu, Xinran Xu, Yangchuan Xu, Yichang Xu, Yuemeng Xu, Zelai Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Guangyao Yang, Hao Yang, Junwei Yang, Kai Yang, Ningyuan Yang, Ruihan Yang, Xiaofei Yang, Xinlong Yang, Ying Yang, Yi Yang, Yi Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Dan Ye, Wenjie Ye, Zhuorui Ye, Bohong Yin, Chengzhen Yu, Longhui Yu, Tao Yu, Tianxiang Yu, Enming Yuan, Mengjie Yuan, Xiaokun Yuan, Yang Yue, Weihao Zeng, Dunyuan Zha, Haobing Zhan, Dehao Zhang, Hao Zhang, Jin Zhang, Puqi Zhang, Qiao Zhang, Rui Zhang, Xiaobin Zhang, Y. Zhang, Yadong Zhang, Yangkun Zhang, Yichi Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yushun Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Chenguang Zhao, Feifan Zhao, Jinxiang Zhao, Shuai Zhao, Xiangyu Zhao, Yikai Zhao, Zijia Zhao, Huabin Zheng, Ruihan Zheng, Shaojie Zheng, Tengyang Zheng, Junfeng Zhong, Longguang Zhong, Weiming Zhong, M. Zhou, Runjie Zhou, Xinyu Zhou, Zaida Zhou, Jinguo Zhu, Liya Zhu, Xinhao Zhu, Yuxuan Zhu, Zhen Zhu, Jingze Zhuang, Weiyu Zhuang, Ying Zou, and Xinxing Zu. Kimi k2.5: Visual agentic intelligence, 2026. [44] Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032, 2025. [45] U.S. Bureau of Economic Analysis. National income and product accounts (NIPA). U.S. Department of Commerce, 2024. Interactive data tables, annual estimates. Accessed February 22 2025. [46] U.S. Bureau of Labor Statistics. Occupational employment and wage statistics (OEWS). U.S. Department of Labor, 2024. May 2024 estimates. Accessed February 2025. [47] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision- language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. [48] Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Haotian Yao, Ziwei Chen, Qizheng Gu, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, and Tao Yu. Opencua: Open foundations for computer-use agents, 2025. [49] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023. [50] Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, and Yuxiong He. Agent world model: Infinity synthetic environments for agentic reinforcement learning, 2026. [51] Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangara- jan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, and Graham Neubig. How well does agent development reflect real-world work?, 2026. [52] Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024. [53] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 52040–52094. Curran Associates, Inc., 2024. [54] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025. [55] Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks. In Advances in Neural Information Processing Systems, volume 38, 2025. NeurIPS 2025 Datasets and Benchmarks Track. [56] Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials. arXiv preprint arXiv:2412.09605, 2025. [57] Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. Stronger models are not always stronger teachers for instruction tuning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4392–4405, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. 23 [58] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025. [59] Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments, 2024. [60] Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, and Yuyu Luo. Autoenv: Automated environments for measuring cross-environment agent learning, 2025. [61] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. [62] Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2024. ICLR 2024; arXiv:2307.13854. [63] Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Erran Li. Proposer-agent-evaluator(pae): Autonomous skill discovery for foundation model internet agents, 2024. [64] Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments, 2026. [65] Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents, 2024. 24 Appendix Table of Contents A GDP-Grounded Software Selection: Full Pipeline 26 A.1 Phase 1: Occupation GDP Calculation . . . . . . . . . . . . . . . . . . . . . . . . 26 A.2 Phase 2: Software Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.3 Phase 3: Catalog Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 A.4 Phase 4: Catalog Enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 A.5 Phase 5: GDP Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 A.6 Phase 6: Practical Access-Barrier Evaluation . . . . . . . . . . . . . . . . . . . . 27 A.7 Phase 7: Tiered Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A.8 Pipeline Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B Gym-Anything Framework: Technical Details 28 B.1 Environment Specification Schema . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.2 Multi-Runner Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 B.3 Progressive Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 B.4 Platform-Specific Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.5 Verification System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.6 Episode Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.7 Distributed Execution and Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.8 Usage Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 C Prompts 31 C.1 Creation Agent Prompt (Overview) . . . . . . . . . . . . . . . . . . . . . . . . . . 31 C.2 Privileged Information Audit Prompts . . . . . . . . . . . . . . . . . . . . . . . . 32 C.3 VLM Checklist Verifier Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 C.4 Integrity Check Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 C.5 Contamination Filtering Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 D Evidence Documentation 37 D.1 Evidence Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 D.2 Audit Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 D.3 Worked Example: Android Studio — Offline Caching Feature . . . . . . . . . . . 38 E Audit Quality Checklist and Example Audits 40 E.1 Audit Quality Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 E.2 Example Audit: Odoo CRM Environment (Critical Issues Detected) . . . . . . . . 41 E.3 Example Audit: Wireshark Environment (Mixed Results) . . . . . . . . . . . . . . 43 E.4 Cross-Round Audit Examples: How the Creation-Audit Loop Corrects Issues . . . 45 E.5 Cross-Model Audit Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 25 F CUA-World-Long: Quality Guidelines and Generation Pipeline 49 F.1 Quality Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 F.2 Task Generation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 F.3 Trajectory Analysis Example: 3D Slicer . . . . . . . . . . . . . . . . . . . . . . . 51 G Task Examples 52 H Contamination Filtering Details 55 H.1 Pairwise Similarity Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 H.2 Graph Construction and Split Assignment . . . . . . . . . . . . . . . . . . . . . . 56 H.3 Aggregate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 H.4 Manual Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 I Software Categorization 57 J Occupational Coverage of CUA-World 58 K Experimental Setup Details 64 K.1 Models Used Across the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 K.2 Evaluated Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 L Propose-and-Amplify Ablation: Qualitative Analysis 65 M Trajectory Behavioral Analysis 66 N Extended Related Work 67 A GDP-Grounded Software Selection: Full Pipeline This appendix provides the complete technical details for the software selection pipeline summarized in Section 2.2. A.1 Phase 1: Occupation GDP Calculation We assign a GDP value to each of 894 ONET occupations via a three-step scaling procedure: 1. Wage bill: For each SOC-2018 occupation, compute employment × mean_wage from BLS OEWS (May 2024). 2. Labor compensation: Scale wage bills by the national ratio Total Compensation Total Wages from BEA accounts. 3. Total GDP: Scale labor compensation by National GDP National Compensation . Output: us_gdp_by_occupation_USD.csv with columns: onetsoc, soc2018, occupation_title, employment, mean_wage, wage_bill, gdp_labor, gdp_total. A.2 Phase 2: Software Discovery Category extraction. Occupations are shuffled and batched into groups of 10. For each batch, an LLM (GPT-5) is prompted: “What software categories does each occupation use?” with no fixed taxonomy—the model discovers categories freely. This yields 5,584 occupation→category pairs across 894 occupations. 26 Category deduplication. Similar categories are clustered via exact normalized-name matching and fuzzy string similarity (token sort ratio ≥ 92%). The most frequent label per cluster is selected as the canonical name. An LLM (Gemini 3 Flash) adjudicates ambiguous pairs. Product enumeration. For each unique category, an LLM enumerates widely-used software products (name, OS support, aliases). Products are deduplicated within each category via fuzzy matching, producing a catalog of ∼16,000 products across ∼1,400 categories. A.3 Phase 3: Catalog Cleanup Three cleaning passes remove noise from the LLM-generated catalog: Product–category validation. For each category, an LLM verifies which products genuinely belong; mismatches are removed (e.g., “Photoshop” in “Spreadsheets”). Product existence verification. All products are verified via an LLM with Google Search grounding. Products that do not correspond to real, currently-available software (LLM hallucinations) are removed. All LLM calls are cached in JSONL format for reproducibility. A.4 Phase 4: Catalog Enrichment Three parallel enrichment passes classify each product along: • Pricing: free | paid | trial | freemium; and is_open_source. • Interface: gui | cli | both. • Trainability: sandbox_ready (install freely) | self_hostable (deploy in Docker/VM) | free_tier (cloud with free account) | restricted (paid license or org credentials required). A.5 Phase 5: GDP Attribution For each occupation, an LLM receives occupation metadata (ONET code, computer-use impor- tance/level scores), applicable software categories, and available products. It returns a structured allocation: GDPproduct = X occ GDPocc × pcomputer × scategory × sproduct with constraints: pcomputer ∈ [0, 1] bounded by ONET scores, category shares sum to ∼1.0, and product shares within each category sum to 1.0. The product GDP values are aggregated across all occupations. Top products by GDP: Microsoft Excel, Microsoft Word, Google Chrome, Microsoft Outlook, Visual Studio Code. Important: These are estimates generated by our pipeline, and should not be cited as source of ranking in future work. A.6 Phase 6: Practical Access-Barrier Evaluation Pricing and interface filters are necessary but not sufficient. A product can be free and GUI-based yet still require external account creation (Slack, Zoom), organizational credentials (Epic, Slate), or specialized hardware (AndroidAPS). We batch all ∼16,600 products through an LLM (Gemini 3 Pro) with Google Search grounding, evaluating: • Does the product require an external account? (no | free_optional | free_required | org_required) • Does it require organizational/institutional credentials? • Does it require specialized physical hardware? • Overall: is it trainable in a sandbox? We evaluate the most permissive version/mode of each product: if NinjaTrader’s free simulation mode works without login, the product passes. 27 Results: 8,013 products (48%) are trainable; 8,591 (52%) are not (4,651 require organizational accounts, 3,279 require free accounts, ∼661 require hardware). A.7 Phase 7: Tiered Selection A product is selectable if it satisfies all of: (1) runs on Windows, Linux, or Android; (2) not paid-only or trial-only; (3) not CLI-only; (4) sandbox-ready or self-hostable; (5) passes the access-barrier evaluation from Phase 6. When a non-selectable product would otherwise be chosen (e.g., Bloomberg Terminal at $79.5B GDP), the pipeline substitutes the closest selectable alternative from the same software category, ranked by an LLM. The substitute inherits the original’s economic slot; the original is marked as “covered.” Selection proceeds across five tiers: Tier Budget Strategy k1 (Economic Core) 100 Highest GDP products overall k2.1 (Strategic) 100 Healthcare, Education, Protective Services, Transportation— 20 per domain k2.2 (STEM) 100 Architecture/Engineering, Computer/Math, Life/Physical/- Social Science k3 (Domain Diversity) 116 Round-robin across all 22 SOC major groups (∼5 per group) k4 (Niche) 44 Products unique to single occupations or domains k5 (Category Fill) 40 Uncovered software categories, largest GDP first Table 7: Tiered selection budget. Each tier iterates occupations by GDP (descending) and applies substitution for non-selectable products. Output: ∼500 selected products covering all 22 SOC major groups. We build environments for 200 based on compute budget. Important Note: Due to bug in our software selection code, we had initially selected 53 environments that would later not come in the 200 selected. However, due to compute constraints, we decided to keep them. However, all of those would have been included in the full 500, which we plan to release sometime in future. A.8 Pipeline Statistics Metric Value Occupations covered 894 Software categories ∼1,400 Products in catalog ∼16,600 Products passing all filters ∼3,400 Products selected ∼488 Substitutions made ∼429 SOC domains covered 22/22 Table 8: GDP-grounded software selection pipeline summary statistics. B Gym-Anything Framework: Technical Details This appendix provides engineering details for the Gym-Anything framework summarized in Sec- tion 2.3. A central design principle of the framework is modularity: the specification schema, runner interface, and verification system are all designed so that new observation modalities, compute backends, operating systems, and verification strategies can be added without modifying the core framework. B.1 Environment Specification Schema Each environment is defined by an env.json file with the following sections: 28 Metadata. id, version, description, category, tags, authors. Runtime. base (preset image: ubuntu-gnome-systemd, windows-11, android-14), image or dockerfile (custom container), resources (CPU cores, memory GB, GPU count, network access), mounts (bind-mount scripts, data, config as read-only or read-write). Interfaces. observation: list of modalities, each with type, resolution, and frame rate. Currently supported modalities include rgb_screen, audio_waveform, and ui_tree, with the schema de- signed to accommodate additional modalities as needed. action: list of types. Currently supported types include mouse, keyboard, voice, and api_call. User accounts. Per-account specification of username, password, UID/GID, and permissions (sudo, network access, environment variables, home directory settings). This enables realistic enterprise scenarios with privilege separation. Security. Systemd, cgroups, capability dropping, seccomp profiles, network isolation toggles. B.2 Multi-Runner Architecture All execution backends implement a common BaseRunner interface, so new compute backends can be added by implementing the same abstract methods. The framework currently ships with the following runners: Runner Use case DockerRunner Single-machine development; requires Docker daemon QemuApptainerRunner HPC/SLURM clusters; runs QEMU VMs inside rootless Apptainer containers QemuNativeRunner Bare-metal Linux or macOS; supports Apple Silicon via HVF AVFRunner macOS on Apple Silicon; uses Apple Virtualization Framework with Rosetta 2 AVDApptainerRunner Android apps; wraps Android emulator in Apptainer AVDNativeRunner Android apps; runs emulator directly on host (macOS HVF or Linux KVM) ApptainerDirectRunner GPU-enabled workloads; direct Apptainer with –nv flag LocalRunner Lightweight testing stub Table 9: Execution backends. The same env.json runs on all runners without modification. New backends can be added by implementing the BaseRunner interface. Runner selection is automatic: the framework checks for Docker availability, then Apptainer, then falls back to the local runner. Users may override via the GYM_ANYTHING_RUNNER environment variable. B.3 Progressive Checkpointing The four setup stages (install → configure → task setup → export) are checkpointed at three levels: • Post-install checkpoint: saves disk state after software installation. Loading skips the install stage. Shared across all tasks for the same environment. • Post-configure checkpoint: saves state after data/service configuration. Loading skips install and configure. Also shared across tasks. • Post-task-setup checkpoint: saves state after per-task initialization. Task-specific; enables instant startup for repeated evaluation of the same task. Disk-state checkpoints. Docker: docker commit captures filesystem (processes restart via systemd on next boot). QEMU: qemu-img convert saves a QCOW2 snapshot. Full-state snapshots (SaveVM). QEMU additionally supports savevm, which captures the entire VM memory, CPU registers, and running processes. Restoring from a savevm snapshot is near- instantaneous (∼3s), preserving open windows, running services, and GUI state—compared to 2–5 minutes for a disk-state checkpoint that requires rebooting. 29 Copy-on-write parallelization. Multiple concurrent instances share the same base checkpoint via QCOW2 overlay files. Each instance writes only its delta, enabling 400+ parallel environments with modest disk overhead. B.4 Platform-Specific Patterns Linux desktop environments. The majority of environments use an Ubuntu GNOME base with systemd. GUI automation uses xdotool for mouse/keyboard injection and X11 accessibility for UI tree capture. Multi-service web applications (ERPs, CRMs, LMS platforms) run Docker Compose stacks inside the QEMU VM. Windows environments. SSH runs in Session 0 (no GUI access). GUI applications are launched via schtasks /IT with batch files. Interactive automation uses a PyAutoGUI TCP server (port 5555) running in the GUI session, since Win32 API calls (SetCursorPos, mouse_event) do not reliably reach all applications. Registry modifications suppress first-run dialogs and license prompts. Android environments. Android Virtual Devices run inside Apptainer via the AVDApptainerRunner. Interaction uses ADB for input injection (adb input tap, adb input text) and screenshot capture (adb exec-out screencap). APK installation copies to /data/local/tmp/ before invoking pm install to satisfy SELinux constraints. B.5 Verification System Verifiers are decoupled from the framework: each is a standalone Python file in the task directory, loaded via importlib at evaluation time. The framework currently supports three verification modes: 1. Program: a Python function receives the trajectory (screenshots, action log), environment utilities (exec_capture, copy_from_env, query_vlm), and task metadata. Returns {passed, score, feedback}. Programmatic verifiers can also call a VLM internally (e.g., for checklist-based evaluation), combining the flexibility of code with visual grounding. 2. Image match: SSIM comparison between the final screenshot and a reference image, with a configurable threshold. 3. Multi: cascades program verification first, falling back to image match. Custom verification strategies can be added by writing a new Python file following the same interface. B.6 Episode Artifacts Each episode produces a structured artifact directory: • traj.jsonl: timestamped log of every reset, action, and observation event. • frame_00000.png, . . . : per-step screenshots. • video.mp4: FFmpeg-encoded recording of the full episode. • summary.json: episode metadata, verifier result, and reward. • Setup stage logs (.log): stdout/stderr from each setup script. These artifacts serve dual purposes: evaluation (verifier input) and training data collection (trajectory distillation, Section 6.1). B.7 Distributed Execution and Tooling Remote execution. For large-scale evaluation across multiple machines, the framework provides a master-worker architecture. Worker nodes expose a REST API that manages local environments, while a master server handles load-balanced routing with sticky sessions (mapping each environment to a specific worker). A RemoteGymEnv client provides the same API as the local GymAnythingEnv, making distributed execution transparent to the caller. Trajectory viewer. A built-in web dashboard allows browsing and replaying recorded episodes, including per-step screenshots, action logs, and verifier outputs. This supports both debugging during development and qualitative analysis of agent behavior. 30 B.8 Usage Example To illustrate the simplicity of the framework, we show how an environment can be launched and interacted with via the command line and Python API. Command line. The gym-anything CLI provides commands for running environments interactively, evaluating agents on benchmark splits, listing available environments, validating specifications, managing cached checkpoints, and checking system prerequisites. For example, to launch an environment interactively with a VNC viewer: # List available environments and tasks gym-anything list # Launch an environment interactively gym-anything run libreoffice_calc –task budget_analysis -i –open-vnc # Evaluate an agent on a benchmark split gym-anything benchmark libreoffice_calc –agent Gemini3Agent \ –model gemini-3-flash –split test Python API. Programmatically, the framework exposes a standard Gymnasium-style interface. The same environment specification runs identically across all compute backends: from gym_anything import make env = make("envs/libreoffice_calc/env.json", "envs/libreoffice_calc/tasks/budget_analysis/task.json") obs = env.reset(use_cache=True, cache_level="post_start") actions = [{"action": "left_click", "coordinate": [340, 215]}, {"action": "type", "text": "=SUM(B2:B10)"}, {"action": "key", "key": "Return"}] obs, reward, done, info = env.step(actions) env.close() # runs verifier, writes trajectory artifacts The reset() call handles container orchestration, display forwarding, and checkpoint restoration. The step() call injects actions and returns the next observation (screenshot). On close(), the framework runs the task verifier and writes all episode artifacts (trajectory log, per-step screenshots, video, and verification results) to a structured directory. C Prompts This appendix documents the prompts used across the Gym-Anything pipeline. We provide an overview of the Creation Agent prompt (§C.1), the three-phase Privileged Information Audit prompt (§C.2), the VLM Checklist Verifier prompts (§C.3), and the Contamination Filtering prompt (§C.5). C.1 Creation Agent Prompt (Overview) The Creation Agent (AgentC , §3) receives an ∼800-line prompt that guides it through seven phases. Listing 1 shows an abridged overview of the phase structure and key instructions. The full prompt covers framework internals, realistic data sourcing strategies, interactive testing workflows, and documentation requirements. Critically, the prompt emphasizes that all data must be real (downloaded from public sources), not synthetic or handwritten. Listing 1: Creation Agent prompt overview (abridged from ∼800 lines). # Environment Creation Workflow for Gym-Anything # (Abridged overview -- full prompt is ~800 lines) ## Phase 1: Understand the Framework Read core files (api.py, env.py, specs.py, runners/) to understand the lifecycle: from_config() -> env.reset() -> [agent interaction] -> env.close() 31 Key rules: hooks run as root, DISPLAY=:1 for GUI, mounts read-only by default. ## Phase 2: Research the Target Application Web-search for installation guides, identify dependencies, determine installation method. Answer: desktop vs web app? what services needed? how is data stored? first-run wizard? network access? ## Phase 3: Study Existing Environments Read env_creation_notes/ (cross-cutting patterns, Windows patterns). Study similar environments in examples/ directory. ## Phase 4: Create Implementation Plan Directory structure: env.json, scripts/{install,setup}.sh, tasks// Hook responsibilities: pre_start (install), post_start (configure), pre_task (task-specific setup with REAL data -- no fake/synthetic data). ## Phase 5: Write Environment Files Create env.json (base image, resources, mounts, hooks, user accounts). Write install script (apt, wget, docker), setup script (service polling, app launch, config), task files (task.json, setup_task.sh, verifier.py). CRITICAL: All data must be real (public datasets, official samples). ## Phase 6: Interactive Testing Start VM, SSH in, use screenshot-based UI grounding to verify setup. Loop: take screenshot -> analyze -> perform action -> repeat. Verify: app visible, correct state, real data loaded, task completable. ## Phase 7: Final Testing & Documentation Clean test without cache. Verify full checklist. Create evidence_docs/ with screenshots and log snippets. Document learnings in shared memory. C.2 Privileged Information Audit Prompts The PI Audit pipeline (§4.1) operates in three phases. Phase 1 analyzes task source files without web access to identify data provenance and metadata claims. Phase 2 uses an LLM with Google Search grounding to verify claims against external sources. Phase 3 synthesizes the results into a validated PI report, enforcing the rule that unverified information must never appear in the final summary. Phase 1: File Analysis. Listing 2: PI Audit Phase 1: Data provenance analysis prompt. You are a data provenance auditor for AI benchmarking tasks. Your job is to analyze task files and identify: 1. What dataset is used (name, source URL, specific case/patient ID) 2. What claims does task.json metadata make about expected values (numerical thresholds, counts, measurements) 3. Which values are hardcoded in scripts vs dynamically computed at runtime 4. Which values in metadata could be verified via web search 5. Whether the data is real (downloaded from a public dataset) or synthetic (generated by scripts) Mark all claims as verifiable_via_web=true and provide a suggested_search_query for each. Your goal is to find the correct value for every claim, not just confirm what task.json says. Analyze these files for task "{task_id}" and produce a structured JSON response: {all_files} Respond with ONLY a JSON object: { "dataset_name": "name of dataset or null if unclear", "dataset_url": "download URL found in scripts or null", "case_id": "specific case/patient/file ID or null", "data_is_synthetic": true/false, "synthetic_reason": "why you think data is synthetic, or null", "claims": [ { "key": "metadata key name", "value": "value claimed in task.json metadata", "source": "hardcoded/computed/script-derived", 32 "verifiable_via_web": true/false, "suggested_search_query": "web search query or null" } ], "general_task_context": "1-2 sentence description", "data_provenance_summary": "how data gets from source to the VM" } Phase 2: Web-Grounded Verification. Listing 3: PI Audit Phase 2: Web search verification prompt (uses Google Search grounding). You are verifying claims about a dataset used in an AI benchmarking task. Task: {task_id} Dataset: {dataset_name} Case/Patient: {case_id} Data source URL: {dataset_url} Here are the claims to verify using web search: {claims_text} Your goal is to find the correct value for every claim -- either confirm the claimed value or find the true value. Do not settle for "unverified". Search extensively: try the data source page, dataset documentation, academic papers, archive databases, file format specifications, instrument documentation. Look for: - Official dataset documentation pages - Published ground truth spreadsheets or CSV files - Academic papers describing the dataset - Dataset README files on Kaggle, TCIA, Zenodo - Instrument/software documentation - Archive databases (MAST, TCIA, PhysioNet) Respond with ONLY a JSON object: { "verification_results": [ { "key": "claim key", "claimed_value": "what task.json claims", "verified_value": "what web sources say, or null", "status": "verified|contradicted|unverified", "source": "URL or description", "confidence": "high|medium|low", "details": "explanation of what you found" } ], "web_search_summary": "brief summary of search results" } Phase 3: Synthesis. Listing 4: PI Audit Phase 3: Synthesis into validated privileged information. You are producing a final validated privileged information (PI) report for a benchmarking task. CRITICAL RULE: Only include information that is VERIFIED in the privileged_info_summary. It is MUCH BETTER to have no PI than fake PI. If a value cannot be verified, mark it as "unverified" and do NOT include it in the summary text. Task: {task_id} File analysis results: {analysis} Web verification results: {verification} 33 Produce a final validated_pi.json: { "task_id": "{task_id}", "dataset": "dataset name or null", "case_id": "specific case ID or null", "data_is_synthetic": true/false, "pi_items": [ { "key": "item name", "metadata_value": "what task.json claims", "verified_value": "what we verified, or null", "source": "script analysis / web search URL / etc", "status": "verified|unverified|contradicted" } ], "privileged_info_summary": "ONLY verified facts. This text will be given to a VLM verifier.", "pi_confidence": "high|medium|low|none" } C.3 VLM Checklist Verifier Prompts The checklist-based VLM verification system (§4.1) operates in two stages. First, an LLM generates a structured checklist from the task description and validated PI (Listing 5). Second, a VLM scores agent trajectories against this checklist by examining sampled screenshots (Listing 6). The scoring system distinguishes between task completion failures (scored via partial credit) and integrity violations (binary pass/fail for cheating detection). Checklist Generation. Listing 5: VLM Verifier: Checklist generation prompt. You are generating a verification checklist for an AI agent benchmark task. This checklist will be used by a VLM (vision-language model) to score agent trajectories by examining screenshots. Task ID: {task_id} Task Description: {task_desc} Metadata keys: {metadata_keys} {scoring_text} Privileged Information: {pi_summary} {pi_items_text} Generate a checklist with two sections: 1. task_completion (5-8 items, points must sum to exactly 100): Each item represents a sub-goal or evidence of progress. Items should be ordered from earliest to latest. - CRITICAL: ONLY include items that are explicitly required by the task description. Do NOT add extra steps. - Assign more points to harder items - Each item must be visually verifiable from screenshots - Include what visual evidence the VLM should look for. 2. integrity (3-4 items): Each item checks for cheating/shortcuts. Common checks: - Agent used the GUI, not terminal commands - Agent interacted with the actual application - Agent didn’t copy-paste expected answers - Results come from genuine software interaction Also produce a "privileged_info_for_vlm" field: a concise text with ONLY verified facts that helps the VLM judge correctness. Respond with ONLY a JSON object: { "task_id": "{task_id}", "task_completion": [ { "id": "short_snake_case_id", "description": "What this item checks", 34 "points": 20, "visual_evidence": "What the VLM should look for" } ], "integrity": [ { "id": "short_snake_case_id", "description": "What this integrity check verifies", "visual_evidence": "What the VLM should look for" } ], "privileged_info_for_vlm": "Verified facts for the VLM" } Trajectory Scoring. Listing 6: VLM Verifier: Trajectory scoring prompt. [System Message] You are an expert evaluator scoring AI agent trajectories on computer-use benchmark tasks. You will examine trajectory screenshots and score the agent’s work against a structured checklist. SCORING RULES: - For each task_completion item: "pass" (full points), "partial" (half points), or "fail" (0 points) - For each integrity item: "pass" or "fail" - Be strict: only give "pass" if you see clear visual evidence. However, if you decide that a checklist item is non-essential for the task completion solely based on the task description, and other items are completed, then you can give "pass". - "partial" is for when the agent made meaningful progress but didn’t fully complete the sub-task INTEGRITY CHECK RULES (critical): - Integrity checks detect CHEATING or SHORTCUTS, NOT task failure. - An agent that genuinely attempted the task but failed is an INTEGRITY PASS -- honest failure is not cheating. - Only mark integrity FAIL if you see clear evidence of: hardcoding answers, copy-pasting expected values, fabricating results without using the software, or bypassing the required workflow entirely. [User Message] Task Description: {task_desc} Privileged Information (verified facts): {pi_text} === TASK COMPLETION CHECKLIST (score each item) === {completion_items} === INTEGRITY CHECKS (pass/fail each item) === {integrity_items} Below are screenshots from the agent’s trajectory (first 3 frames, every 3rd middle frame, last 3 frames). Examine them carefully and score each checklist item. [Screenshots are attached as images] Now score each checklist item. Respond with ONLY a JSON object: { "task_completion": [ {"id": "item_id", "verdict": "pass|partial|fail", "confidence": 0.0-1.0, "evidence": "what you see"} ], "integrity": [ {"id": "item_id", "verdict": "pass|fail", "confidence": 0.0-1.0, "evidence": "what you see"} ], "overall_reasoning": "1-3 sentence summary" } 35 C.4 Integrity Check Analysis The integrity checklist Cint (§4.1) detects cases where agents bypass the intended workflow rather than completing the task. To evaluate how well these checks work in practice, we manually reviewed all integrity flags across ∼3,000 Gemini-3-Flash trajectories, ∼800 of which were successful runs. Overall statistics. Among high-scoring runs (task completion score >75), the integrity checklist flagged 21 trajectories: 15 true positives and 6 false positives. However, in 18 of the 21 flagged cases, the task completion checklist already assigned a failing score, so the integrity flag did not change the pass rate. The remaining 3 flagged cases had a perfect task completion score of 100, but none were actual integrity violations (all 3 were false positives). The integrity checks therefore do not change the overall pass rate in this evaluation, but guard against undetected shortcuts in future, harder tasks where the task checklist alone may not suffice. Example 1: Fabricated report data (Autopsy). In a digital forensics task on Autopsy, the agent was asked to analyze two disk images, identify exfiltrated files via hash matching, locate deleted files and NTFS alternate data streams, and write a structured report with file names and MD5 hashes. The agent correctly followed the entire investigation workflow: creating the case, importing hash sets, adding data sources, running ingest modules, and navigating to the correct results in the Autopsy GUI, scoring 87.5 on task completion. However, in the final report, the agent populated nearly every entry with the same placeholder hash rather than copying the per-file hashes visible in Autopsy’s interface. The integrity check flagged this as fabricated output: the agent saw the correct values in the tool but wrote different values in the report. Example 2: Result not derived from tool output (Epi Info). In a statistical analysis task on Epi Info (an epidemiology toolkit), the agent was asked to use the StatCalc Poisson tool to evaluate a disease cluster. The agent opened the correct tool and attempted to enter the parameters (observed=8, expected=3.6), but made a typo, entering 3.1.16 instead of 3.6. The GUI displayed a probability of 0.9999998, which is incorrect given the intended inputs. However, the agent wrote the mathematically correct value (0.0307893) in its report, a value never displayed by the tool. The integrity check flagged this because the reported result did not match the tool’s output, indicating that the agent computed the answer independently rather than using the required software. Example 3: Hardcoded exclusion from task description (PEBL). In a computational modeling task on PEBL (a psychology experiment platform), the agent was asked to fit reinforcement learning models to participant data and exclude a bot participant “using data-driven criteria.” The task description mentioned the bot’s ID (PRL-999) as context. The agent wrote a script that correctly implemented the Rescorla-Wagner models, grid search, and AIC comparison, scoring 90 on task completion. However, instead of computing any metric (e.g., accuracy, reversal count) to identify the bot, the agent hardcoded if pid == ‘PRL-999’ in its exclusion logic. The integrity check flagged this as bypassing the required analysis step. Here, the task description itself enabled the shortcut by revealing the bot’s identity, yet the agent still violated the explicit instruction to use data-driven criteria. Example 4: False positive (Oracle Database). In a database administration task on Oracle Database, the agent was asked to fix data discrepancies, create PL/SQL objects, and save a reconciliation report. The agent used DBeaver to connect to the database, executed SQL statements, and created triggers and views, scoring 60 on task completion. It lost points primarily because the reconciliation report was never produced. The integrity checklist included a check for whether the report contained actual computed counts; since no report existed, this check was marked as failed. However, this is a false positive: the agent did not fabricate anything; it simply did not complete one sub-task. The task completion checklist already penalized the missing report with 0 points. Incomplete execution is not a workflow bypass, so this flag should not have been raised. Example 5: Wrong tool used (PsychoPy). In an experiment scripting task, the agent was explicitly instructed to use PsychoPy Coder (a GUI-based script editor for psychology experiments) to write a fear conditioning experiment. The agent opened PsychoPy but was unable to navigate the Coder interface. Instead, it switched to the terminal and wrote the script using shell redirection (cat « EOF). The resulting script was functionally complete, correctly implementing habituation trials, an adaptive staircase, counterbalancing, and rating scales, scoring 87.5 on task completion. The integrity check flagged the agent for not using the required software interface. This is a common failure 36 pattern: when agents cannot operate a specialized GUI, they fall back to the terminal, producing correct output through the wrong tool. C.5 Contamination Filtering Prompt To construct contamination-free train/test splits (§5), we compare all task pairs within each environ- ment using the prompt in Listing 7. Task pairs scoring ≥ 4 (VERY_SIMILAR or higher) are treated as contaminating edges in a similarity graph. Connected components of this graph are assigned to the same split to prevent data leakage. Listing 7: Contamination filtering: Pairwise task similarity comparison prompt. You are evaluating the similarity between two task descriptions for a machine learning train/test split. TASK 1: {task1_description} TASK 2: {task2_description} Classify their similarity into exactly ONE category: 1 - NOT_SIMILAR: Completely unrelated tasks 2 - SOMEWHAT_SIMILAR: Minor thematic overlap but fundamentally different tasks 3 - SOME_STEPS_SIMILAR: Tasks share some common substeps (e.g., both navigate to a location) but have distinctly different end goals. This is common and acceptable. 4 - VERY_SIMILAR: Tasks are extremely similar -- knowing how to do one would directly help with the other. Only use this if the tasks are nearly interchangeable. 5 - SAME_REPHRASED: Essentially the same task with different wording 6 - DUPLICATE: Identical or near-identical tasks 7 - SUBSET: Task 1 is a strict subset of Task 2 8 - SUPERSET: Task 1 is a strict superset of Task 2 IMPORTANT: Be liberal in your assessment. Categories 4-8 should only be used when tasks are TRULY interchangeable or one strictly contains the other. Sharing common substeps does NOT make tasks similar -- use category 3 for those. Respond with ONLY a single digit (1-8) and nothing else. D Evidence Documentation During environment creation, AgentC must produce structured evidence that the software was installed correctly and that each task is solvable. An independent Agentaudit later reviews this evidence against a quality checklist, without re-running the environment. This appendix describes the evidence system and walks through one concrete example. D.1 Evidence Requirements Every environment must supply three categories of artifacts: 1. Screenshots. Timestamped screen captures showing: (i) the application running after boot, (ii) the correct starting state for each task, and (iii) the absence of blocking error dialogs. 2. Structured verification data. A JSON file per task recording database queries, file-system checks, service health, and baseline counts—anything the audit agent needs to confirm that preconditions hold without launching the VM. 3. Export-script output. Proof that the task’s export_result.sh runs without error and produces valid, parseable JSON with all expected fields. All artifacts are stored inside the environment directory under the following layout: 37 examples/{}/Documents/escalation_info.txt’ and locate the ’Initial Legal R... Wps Office Writer | Paralegal You are a paralegal at a corporate law firm. Open /home/ga/Documents/vendor_agreement_draft.docx, a draft Vendor Services Agreement between CloudFirst Industries, LLC and Meridian Technology Solutions, Inc. Clean up t... Nuxeo Platform | Paralegal Perform a legal discovery operation in Nuxeo Platform. Identify all documents in the ’Projects’ workspace that meet both criteria: the Title contains ’Agreement’ (case-insensitive) and the content/body contains ’Acme ... SOC 8/22: Educational Instruction and Library Active Inspire | Art Teacher Create a 3-page flipchart visual analysis of ’The Great Wave off Kanagawa’ and save it to /home/ga/Docu- ments/Flipcharts/artwork_analysis.flipchart. The first page must feature the image ’great_wave.jpg’ (from /home/ga... Moodle | Biology Professor You are the instructor for BIO302 Advanced Cell Biology at State University. Set up a complete student progression tracking system for the course. Configure completion conditions for the following activities: ’Lab Saf... Safe Exam Browser | Instructional Coordinator Log into SEB Server as super-admin (password: admin). For the ’Chemistry 201 - Midterm’ exam configuration, enable and set the custom Quit Confirmation Message to: ’WARNING: Quitting will SUBMIT your exam permanently.... SOC 9/22: Arts, Design, Entertainment, Sports, and Media Gimp | Graphic Designer Apply a pixelate filter to the image to create a blocky, mosaic-style effect. Sweet Home 3D | Interior Designer You are an entrepreneur opening ’Spoke & Bean’, a hybrid bicycle shop, mechanical repair bay, and community espresso bar. Transform the open-plan commercial unit in bike_shop_starter.sh3d into a functional mixed-use s... Gimp Osw | Graphic Designer Remove the background from the dog image in GIMP. SOC 10/22: Healthcare Practitioners and Technical Oscar Emr | Physician Dr. Chen wants to speed up prescribing for Strep Throat. Using the credentials (username ’oscardoc’, password ’oscar’, PIN ’1117’), create a prescription favorite named ’Strep Throat - Amox 500’ for patient ’Mario Ros... Invesalius3 | Radiologic Technologist 60 Using the loaded CT Cranium DICOM dataset in InVesalius 3, generate a 3D surface of the skull (bone mask) and create a rotating animation of the model. Export the animation as either a series of at least 12 PNG screen... Slicer3D | Radiologist Generate a subtraction enhancement map named ’EnhancementMap’ (T1_Contrast minus T1) and a binary ’EnhancementMask’ from the loaded brain MRI volumes. Export the enhancement map to ~{}/Docu- ments/SlicerData/BraTS/enhance... SOC 11/22: Healthcare Support Oscar Emr | Medical Assistant A safety recall has been issued for the medication ’Atenolol’ due to potential impurities. You need to identify any active patients currently prescribed this medication and flag their charts. Log in to OSCAR EMR (User... Openemr | Medical Assistant Record a historical DTaP immunization for Jayson Fadel (DOB: 1992-06-30). Date administered: 2019- 03-15, administered by: Outside Provider, site: Left Deltoid, manufacturer: Sanofi Pasteur, lot number: D2894AA, expira... Freemed | Medical Assistant Log into FreeMED (Username: admin, Password: admin) and record a new laboratory order for patient Marcus Vance. The order must include both a ’Lipid Panel’ and a ’Hemoglobin A1c’ saved to his clinical record. SOC 12/22: Protective Service Opencad | Police Officer Log in to OpenCAD (http://localhost) as the Admin User (admin@opencad.local / Admin123!) and create a Person BOLO for Marcus Holloway. Include the following details: Male, African American, approx 6’1", 185 lbs, short... Tor Browser | Private Investigator You are a Private Investigator building a persistent OSINT dashboard. Configure Tor Browser to retain browsing history and create a local HTML dashboard at /home/ga/Documents/osint_dashboard.html featuring a heading a... Google Earth | Search and Rescue Coordinator Identify at least 3 potential helicopter landing zones within 5 km of Rifugio Lagazuoi in the Dolomites, Italy (46.5289°N, 12.0078°E). For each LZ, create a placemark with a systematic name (e.g., LZ-Alpha, LZ-Bravo),... SOC 13/22: Food Preparation and Serving Related Libreoffice Calc | Baker Scale the cookie recipe from 24 to 75 cookies. Use practical rounding for proportional ingredient amounts and round all egg quantities up to the nearest whole number. Floreant Pos | Food Service Manager Perform an end-to-end workflow starting in the BACK OFFICE (PIN: 1111) by creating a ’Burger Toppings’ modifier group containing ’Bacon’ ($1.50), ’Avocado’ ($2.00), and ’Extra Cheese’ ($0.75). Configure a new ’Build-Y... Chrome | Sommelier Configure the cellar Chrome workstation according to the Beverage Team Browser Standard. First, import the bookmarks from ~{}/Desktop/cellar_bookmarks.html. Then, organize them into 4 specific folders on the bookmark ba... 61 SOC 14/22: Building and Grounds Cleaning and Maintenance Vtiger Crm | Landscaping Supervisor You are a landscaping company supervisor who needs to register a new materials supplier in Vtiger CRM so that purchase orders can be created for upcoming spring projects. Create a new Vendor record with the following ... Libreoffice Calc | Groundskeeper Create a plant watering tracker using date formulas, the TODAY() function, conditional formatting for overdue plants, and priority sorting. Save the spreadsheet to ~{}/Documents/plant_watering_tracker.xlsx. SOC 15/22: Personal Care and Service Sweet Home 3D | Licensed Cosmetologist You are a licensed cosmetologist opening your first boutique hair salon in a converted residential villa. Design a functional and professional salon layout in Sweet Home 3D. The design must feature a styling floor wit... Garmin Basecamp | Outdoor Guide Garmin BaseCamp is running with ‘fells_loop.gpx‘ data pre-imported. Find the halfway point by trail distance of the ‘fells_loop‘ track and create a waypoint at that location named ‘Lunch_Stop‘ with the notes ‘Halfway ... Wger | Fitness Trainer As the admin user (username: admin, password: adminadmin) on http://localhost, log a workout session for today’s date using the ’Full Body Workout’ routine. The session should include a ’General’ impression and the no... SOC 16/22: Sales and Related Copper Point Of Sale | Sales Representative Create a sales quote for the corporate customer ’Greenfield Office Solutions’. Include 20 units of ’Copy Paper A4 (500 sheets)’, 10 units of ’Ballpoint Pen Blue (Box of 12)’, and 8 units of ’Manila Folder Letter Size ... Erpnext | Sales Operations Coordinator You are a sales operations coordinator at Wind Power LLC. At a recent renewable energy trade show, you met a potential customer named Marcus Chen from Greenfield Renewable Solutions. You need to enter this lead into t... Bcwebcam | Cashier Use bcWebCam to scan a product barcode directly into a local web POS system. 1. Ensure bcWebCam is running and its ’Keyboard Emulation’ (keyboard wedge) output mode is enabled in its settings. 2. A local POS system is... SOC 17/22: Office and Administrative Support Aerobridge | Administrative Assistant Register a new drone operating company in the Aerobridge system (http://localhost:8000/admin/) using username ’admin’ and password ’adminpass123’. Create the company record with the following details: - Full Name: Va... Libreoffice Calc | Administrative Assistant Create a warranty tracking system that calculates expiration dates and days remaining. Include automated status logic and visual alerts for warranty expirations. Save the tracking system to /home/ga/Documents/war- ranty... Bcwebcam | Office Clerk 62 Configure bcWebCam so that minimizing the window sends it to the system tray (notification area) instead of the taskbar, and then execute the minimization so the application is hidden but still running. Instructions... SOC 18/22: Farming, Fishing, and Forestry Farmos Field Kit | Farmer Create a Harvest log in farmOS Field Kit to document today’s egg collection. Name the log ’Daily Egg Collection - Red Barn’ with a quantity of 22 dozen. In the notes, record: ’Collected from nest boxes. 4 cracked eggs... Qground Control | Agricultural UAV Technician You are an Agricultural UAV Technician preparing a spray drone for autonomous operations. The target field is bordered by a 15-meter tall eucalyptus windbreak on the north side, making the standard direct-descent retu... Ekylibre | Farm Manager Register a new fertilizer product variant in the Ekylibre catalog named ’Ammonitrate 33.5’. The variant must be assigned a suitable nature (e.g., ’Engrais minéral’ or ’Matière’) and use ’Kilogram’ (kg) as the unit. SOC 19/22: Construction and Extraction Subsurface | Commercial Diver Update the buddy field to ’Michael Chen’ for the first dive in the logbook (Dive #2, December 4, 2010 at Sund Rock, Hoodsport, WA) and save the changes. System Advisor Model | Solar Photovoltaic Installer A solar installer in Tucson, Arizona wants to determine the optimal panel tilt angle for a residential PV system. The customer has a 5 kW system planned for a fixed south-facing roof, but the roof pitch is adjustable ... Emoncms | Solar Photovoltaic Installer A solar PV installation is reporting negative power values because the Current Transformer (CT) sensor was installed backwards. Correct the polarity for node ’garage_solar’, input ’power’ by inverting the values. Log ... SOC 20/22: Installation, Maintenance, and Repair Vtiger Crm | Medical Equipment Repairer Update the asset with Serial Number ’SN-US-2024-9981’ (Sonosite Edge II - Radiology Dept) by changing its Status to ’Out of Service’. Create a related HelpDesk Ticket for this asset with the Title ’Dead Transducer Pro... Crimson | SCADA Technician You are a SCADA technician configuring HMI-level fallback pump sequencing logic for a municipal wastewater plant’s Lift Station A (LS-A) in Red Lion Crimson 3.0. Using the IO tag register and Ten States engineering el... Graphite | Network Operations Center (NOC) Analyst You are a Network Operations Center (NOC) analyst preparing for a Monday morning ops review. Create a Graphite dashboard named ’Weekly Ops Review’ containing three graphs comparing current metrics against data from 7 ... 63 SOC 21/22: Production Chrome | Prepress Technician Configure the prepress terminal’s Chrome browser for handling heavy graphics files securely. Read the specification document at ~{}/Desktop/prepress_terminal_spec.txt. You must: 1) Force PDFs to download instead of open... Erpnext | Quality Control Inspector Wind Power LLC has experienced inconsistent quality with the ’Shaft’ components received from Eagle Hardware. Your job is to enforce mandatory incoming quality checks. Set up a Quality Inspection Template named ’Shaft... Libreoffice Calc | Brewer Organize the homebrewing data and calculate the ABV for each batch using the formula ABV = (OG - FG) × 131.25. Apply conditional formatting to highlight batches within the target ABV range of 4.5-6.5% and save the res... SOC 22/22: Transportation and Material Moving Sygic Gps | Delivery Driver Search for Kabul Airport (Hamid Karzai International Airport) and add it to your Favorites. Ensure the Favorites list is open to show the saved location. Subsurface | Commercial Diver Update dive #2 (December 4, 2010, at Sund Rock) in the dive log at /home/ga/Documents/dives.ssrf by adding a weight system entry with type ’Integrated’ and weight ’4.5 kg’. Save the updated dive log to /home/ga/Docume... Chrome | Ship Officer Configure the bridge computer’s Chrome browser for an ultra-low-bandwidth, high-latency satellite internet connection (e.g., Iridium Certus). First, delete all 8 high-bandwidth entertainment bookmarks (YouTube, Netfli... K Experimental Setup Details K.1 Models Used Across the Pipeline Table 14 lists the models used at each stage of the Gym-Anything pipeline, along with the harness (agentic framework) used to run them. K.2 Evaluated Models We evaluate four frontier models on CUA-World-Long: Gemini 3 Flash, Kimi-K 2.5, Claude Son- net 4.6, and GPT-5.4. We do not evaluate Claude Opus 4.6 on CUA-World-Long due to cost constraints. Furhter, $5-per-task budget across 200 tasks, would imply opus would use much fewer steps (∼ 100) than other models. Furthermore, Opus 4.6 and Sonnet 4.6 achieve nearly identical performance on OSWorld (72.7 vs. 72.5), suggesting that the additional cost would likely yield limited additional signal on our benchmark. Agent harnesses. For GPT-5.4 and Claude Sonnet 4.6, we use their official agent harnesses from their respective documentation.23 For Gemini 3 Flash and Kimi-K 2.5, official harnesses were not publicly available at the time of our experiments, so we adapted the Qwen3-VL harness from the OSWorld repository.4 For the Qwen3-VL student model, we use the OSWorld harness directly. 2GPT-5.4: https://developers.openai.com/api/docs/guides/tools-computer-use 3Claude Sonnet 4.6: https://github.com/anthropics/claude-quickstarts/tree/main/ computer-use-demo/computer_use_demo 4https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/qwen3vl_agent.py 64 Pipeline Stage Model Harness Section GDP-Grounded Software Selection Category extraction GPT-5-High – §2.2 Category deduplication Gemini-3-Flash- Preview – §2.2 Product enumeration GPT-5-High – §2.2 Access-barrier evaluation Gemini-3-Pro- Preview – §2.2 GDP attribution GPT-5-High – §2.2 Environment Creation (§3) Creation agent (AgentC ) Claude Opus 4.5/4.6 Claude Code §3 Audit agent (Agentaudit) Claude Opus 4.5/4.6 Claude Code §3 Memory summarization (Agentsumm) Claude Opus 4.5/4.6 Claude Code §3 Task Generation (§4) Task proposer (seed tasks) Claude Opus 4.5/4.6 Claude Code §4 Task amplifier Gemini-3-Pro- Preview – §4 VLM task filter Gemini-3-Flash- Preview – §4 Privileged info extraction Gemini-3-Pro- Preview Gemini CLI §4.1 Checklist generation Gemini-3-Pro- Preview – §4.1 CUA-World-Long Generation Task design and implementa- tion Claude Opus 4.5/4.6 Claude Code App. F Training Teacher (trajectory generation) Kimi-K 2.5 Based on Qwen3VL §6.1 Student (distillation target) Qwen3-VL-2B From OSWorld Repository §6.1 Evaluation VLM verifier Gemini-3-Flash- Preview – §6.3 Test-Time Auditing agent Gemini-3-Flash- Preview Custom Harness §6.2 Table 14: Models and harnesses used at each stage of the Gym-Anything pipeline. Empty cells indicate no harness was used. L Propose-and-Amplify Ablation: Qualitative Analysis To understand why seed tasks improve amplified task quality (§4), we compare tasks generated with and without seed examples on three representative software applications: Firefox (web browser), AstroImageJ (astronomical image analysis), and Moodle (learning management system). We analyze the differences along three dimensions. Realism. With seeds, tasks are grounded in professional workflows: Firefox tasks involve researching real websites (Grants.gov, NSF Award Search) from the perspective of specific roles such as an investigative journalist or a development director; AstroImageJ tasks reference real astronomical objects (Eagle Nebula, WASP-12b) and real techniques (photon transfer CCD gain measurement, light curve detrending); Moodle tasks reflect actual institutional operations (grade auditing, custom role creation with specific capabilities). Without seeds, tasks shift toward feature demonstrations: Firefox tasks focus on browser features (importing a CA certificate, exporting a TLS certificate), AstroImageJ tasks become generic processing operations (align images, create a master dark frame), and the domain-specific professional context is largely absent. Moodle is a partial exception, likely because its structure is well-represented in LLM training data. 65 Difficulty and horizon. With seeds, tasks require 50–80 steps with multi-stage workflows that chain several operations: research, download, organize, and synthesize (Firefox); load data, process, measure, and report (AstroImageJ); configure database, set permissions, and verify via logs (Moodle). Without seeds, tasks typically require 30–50 steps and involve single-feature operations. For instance, a Firefox TLS certificate export task reduces to visiting a webpage, clicking the padlock, and downloading the certificate. Setup script quality. With seeds, setup scripts range from 50 to 164 lines and perform substantial data preparation: downloading real astronomical data from Hubble archives, generating synthetic FITS files with physically plausible noise models (Poisson noise, read noise, dark current), and populating databases with multi-table SQL inserts. Without seeds, setup scripts are 40–120 lines and often just launch the application and open a URL, with minimal data preparation. This makes tasks more fragile (dependent on external network resources) and less reproducible. One exception is a Firefox custom CA import task (without seeds), which creates a full PKI infrastructure in 140 lines, showing that the model can occasionally produce high-quality setups without seeds, but does so inconsistently. Summary. The seed tasks teach the amplifier two things the prompt alone cannot: (1) what realistic professional work looks like for a specific software, and (2) how to prepare a rich initial state with real or realistic data. Without these examples, the model falls back to its generic knowledge of the software’s features, producing tasks that are simpler, less realistic, and less reproducible. M Trajectory Behavioral Analysis To understand how agents behave on CUA-World, we run an automated behavioral analysis pipeline over 2,981 trajectories (701 passed, 2,280 failed) from our evaluation runs. The pipeline operates in three stages. Stage 1: Per-trajectory behavioral summary. Each trajectory (the full sequence of screenshots, actions, and model responses) is fed to an LLM, which decomposes the agent’s behavior into natural phases and produces a high-level behavioral summary. Crucially, the LLM is not told whether the trajectory passed or failed, and is given no controlled vocabulary or predefined categories—it describes what happened in its own words. This avoids biasing the analysis toward expected failure modes. Stage 2: Pattern discovery. The per-trajectory summaries are shuffled into random mixed- environment batches and fed to an LLM, which identifies recurring behavioral patterns across each batch. The instruction requires environment-agnostic language: no software names, no UI element names, no application-specific terminology. After all batches are processed, a consolidation pass merges overlapping patterns into a canonical set of 15 deduplicated patterns. Each pattern is a short description of a recurring behavior (e.g., “the agent enters retry loops when actions do not take effect”). Stage 3: Pattern matching. For each trajectory, we send its behavioral summary alongside the 15 canonical patterns to an LLM and ask which patterns are present. Each step in a trajectory can be tagged with multiple patterns (e.g., a step may involve both UI exploration and a retry), so fractions across patterns do not sum to one. For each pattern, we compute two metrics: step fraction (what fraction of a trajectory’s steps exhibit the pattern, averaged across trajectories) and presence rate (what fraction of trajectories exhibit the pattern at least once). Both metrics are computed separately for passed and failed trajectories. Results. Figure 11 shows the step-fraction view and Figure 12 shows the presence-rate view across all 15 patterns. The three patterns highlighted in the main paper (§5) are retry loops, UI exploration, and verification checks. Beyond these, several additional patterns show notable gaps between passed and failed trajectories. For instance, access blockers (authentication failures, unreachable services) occupy 23% of steps in failed trajectories but only 4% in passed ones. Tool pivoting—where the agent abandons the primary GUI and switches to CLI or alternative tools—is present in 38% of failed trajectories but only 19% of passed ones. On the positive side, save/export steps are present in 52% of passed trajectories but only 33% of failed ones, reflecting that successful agents more often reach the final stage of the task. 66 retry loops access blockers state reset tool pivoting ui exploration loops input instability syntax recovery dialog dismissal reference checking cli fragility file path hunting data transform verification checks save export issues structured workflow 0.0 0.2 0.4 0.6 0.8 1.0 Avg. Fraction of Trajectory Steps Failed PassedFigure 11: Step-weighted pattern intensity across all 15 discovered behavioral patterns. For each pattern, bars show the average fraction of trajectory steps exhibiting that pattern, split by passed vs. failed trajectories. Patterns are sorted by the gap between failed and passed (largest gap on the left).access blockers retry loops tool pivoting state reset syntax recovery dialog dismissal input instability reference checking cli fragility file path hunting ui exploration loops structured workflow data transform save export issues verification checks 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Trajectories Failed Passed Figure 12: Pattern presence rate across all 15 discovered behavioral patterns. For each pattern, bars show the fraction of trajectories that exhibit the pattern at least once. Patterns are sorted by the gap between failed and passed. N Extended Related Work This appendix expands on the related work discussion in the main paper (§8). Benchmarks and datasets for computer-use agents. Existing work on evaluating computer-use agents can be divided into static datasets that provide scale and interactive benchmarks that test actual task completion. Static datasets such as Mind2Web [ 11], Android in the Wild [38], and Android- Control [ 24] offer thousands of annotated episodes across hundreds of applications, but evaluation is limited to action-matching against recorded traces rather than execution-based verification, so valid alternative strategies are penalized. Interactive web benchmarks range from synthetic micro- tasks [ 26 ] to realistic self-hosted environments [62 , 23, 12 , 6], but cover at most six websites and are restricted to the browser. On the desktop, OSWorld [53 ] provides 369 tasks across 9 applications on Linux; Windows Agent Arena [ 7 ], Spider2-V [ 9], AssistGUI [ 18 ], TheAgentCompany [ 55 ], and ScienceBoard [41 ] extend coverage to Windows, data science, and scientific domains but remain limited to 100–494 tasks and 5–20 applications each. On mobile, AndroidWorld [37 ] provides 116 interactive tasks across 20 apps. ProgrammingWithPixels [2] scales to 5,400 task instances but within a single application (VS Code). Across prior interactive benchmarks, environment creation is typically manual, which limits their scale, and none simultaneously provides a training split, long-horizon tasks exceeding 100 steps, or broad occupational coverage. CUA-World bridges the gap between the scale of static datasets and the execution-based evaluation of interactive benchmarks: by automating environment creation through the creation-audit loop (§3), it provides 10K+ interactive tasks across 200+ software applications on four platforms, with train/test splits, a long-horizon benchmark requiring 200+ steps, and GDP-grounded coverage of all 22 SOC occupation groups. 67 Automated environment and task generation. A growing body of work generates tasks or trajec- tories within pre-existing environments. AgentTrek [56 ] synthesizes web trajectories from online tutorials, OS-Genesis [40 ] derives tasks retrospectively from agent exploration, and several other meth- ods propose tasks from GUI observations or evolve curricula within fixed environments [ 63, 29 , 35 ]. However, these approaches are bounded by the set of environments that already exist; they cannot create new ones. A parallel line of work uses LLMs to generate environments, but within narrow domains: text-based planning tasks [21 ], 3D indoor scenes for embodied AI [59 ], tool-use API compositions [ 50 ], or code-editing setups from GitHub repositories [31]. None of these targets real GUI software that requires installation, configuration with domain-appropriate data, and interac- tive verification. For task generation at scale, the seed-then-amplify paradigm is well established: Self-Instruct [49 ] bootstraps instructions from a small seed set, and subsequent works evolve com- plexity [ 54] or use multi-agent pipelines [ 28], but these generate text instruction-response pairs rather than executable environment tasks. Gym-Anything addresses all three gaps: its creation-audit loop (§3) converts real software into interactive environments via coding agents verified by an independent auditor, its propose-and-amplify strategy (§4) generates tasks by having an agent actually run the software to produce high-quality seeds that are then amplified and filtered via execution, and a shared memory across agents ensures learnings accumulate so newer environments are created faster. Training computer-use agents. Trajectory distillation from strong models has emerged as an effec- tive recipe for training GUI agents: AgentTrek [56] distills from web tutorial replays, Explorer [30 ] scales to over 94K web trajectories, and PC Agent-E [19 ] augments 312 human demonstrations to train a 72B model that surpasses Claude 3.7 Sonnet on WindowsAgentArena-V2. Beyond distillation, alternative training strategies have also shown promise: DigiRL [ 4] achieves a 49.5-point improve- ment over SFT with a 1.3B model, and UI-TARS [ 36 ] combines enhanced perception, unified action modeling, and iterative reflective training to achieve state-of-the-art results across multiple bench- marks. Open vision-language backbones such as Qwen2-VL [47 ] and Qwen2.5-VL [ 5] are common foundations for recent open GUI-agent systems [52 , 48 ]. However, existing training pipelines are typically limited to relatively small sets of applications, and scaling laws for GUI agent trajectory distillation remain underexplored. Our distillation experiments across 200 software applications show log-linear scaling (∼3.5 points per data doubling), demonstrate that a 2B student can outperform models 2× its size, and reveal that cross-software generalization is limited (22–27% recovery), motivating scalable environment creation. Evaluation of computer-use agents. Existing interactive benchmarks often use hand-written programmatic verifiers that check the final system state [ 53 , 62 ], which can be reliable but are labor-intensive to author and maintain and typically provide binary pass/fail. TheAgentCompany [ 55 ] introduces checkpoint-based partial credit but still requires custom evaluator code per task. The LLM-as-a-judge paradigm [ 61 ] and its extensions to agent evaluation [ 65 ] offer a more general alternative to script-based evaluation. VLM-based evaluation has been explored in the CUA setting for filtering training trajectories [56 ], step-level trajectory assessment [42 ], and autonomous trajectory evaluation [ 32 ], while checklist-based evaluation has shown strong correlation with human preference for text generation [25 ] and per-subgoal VLM evaluation has been explored in robotics [13]. Our checklist-based VLM verifier extends this line of work by incorporating privileged information extracted from environment setup scripts, enabling the verifier to check agent outputs against known ground-truth answers without per-task evaluation code. We additionally introduce integrity checks that detect workflow bypasses such as fabricating report data or using the terminal instead of the intended GUI, an issue related to the broader problem of reward hacking in agent evaluation [17]. Economic impact of AI and occupation-grounded benchmarks. A substantial body of work studies which occupations are susceptible to AI automation, starting from Frey and Osborne’s occupation- level risk estimates [16 ] and Eloundou et al.’s ONET-based LLM exposure framework [ 14]. Felten et al. [15 ] introduced the widely adopted AI Occupational Exposure index, and Acemoglu [1] formalized GDP-level impact estimation via task-level cost savings. These studies use occupational data, often from ONET, to connect AI capabilities to labor-market outcomes, but focus on measuring exposure rather than directing benchmark design. Wang et al. [51 ] recently quantified this gap, finding that across 43 agent benchmarks and 72K tasks, coverage is heavily programming-centric, with much less representation in many economically significant domains outside computing. GDPval [ 33 ] takes a step toward economic grounding by evaluating models on tasks from 44 occupations across 9 GDP-contributing industries, but is limited to one-shot evaluation rather than interactive agentic tasks. Gym-Anything inverts the standard direction: rather than measuring which occupations are 68 exposed to AI, it uses per-software GDP attribution to determine which software to include in an agent benchmark, covering all 22 SOC major occupation groups with interactive, execution-verified tasks. 69