Multi-Agent Environment Creation

Summary: A methodology that uses specialized creation and audit agents to automatically build and verify software environments at scale. This approach enables systematic generation of complex interactive environments while maintaining quality through independent validation loops, as demonstrated by the Gym-Anything framework's creation of 10,000+ verified tasks across 200+ software applications.

Overview

Multi-agent environment creation employs a division of labor between specialized agents to address the challenges of automated environment generation at scale. The methodology centers on a creation-audit loop where:

A creation agent generates interactive software environments with tasks, setup scripts, and evaluation criteria
An audit agent independently verifies environment quality, task feasibility, and correctness of success conditions
Memory summarization agents distill successful patterns and failures to improve future iterations

This separation of concerns prevents single-agent biases and ensures robust environment generation. The Gym-Anything framework demonstrates this approach by creating CUA-World, containing 10,103 tasks across 200+ software applications covering all 22 SOC occupation groups.

The methodology addresses critical challenges in automated environment creation:

Quality assurance: Independent verification catches errors creators might miss
Scalability: Automated processes enable generation of thousands of environments
Economic relevance: GDP-Grounded Software Selection ensures focus on high-impact applications
Reliability: Privileged Information Verification using setup script data provides ground-truth validation

Performance Impact:

Test-Time Auditing improved Gemini-3-Flash performance from 11.5% to 14.0% on Long-Horizon Task Planning tasks
Independent audit agents reduced false positive completion rates by catching premature task termination
Multi-round creation-audit cycles achieved consistent quality across diverse software domains

Key Details

Architecture Components:

Creator agents generate complete task specifications including setup procedures, interaction requirements, and success criteria
Auditor agents perform checklist-based verification using Checklist-Based VLM Verification with privileged information from setup scripts
Feedback loops enable creators to refine outputs based on audit findings across multiple iterations
Memory agents summarize behavioral patterns to improve future environment generation

Implementation in CUA-World:

Generated 10,103 verified tasks across healthcare, engineering, finance, and scientific software applications
Created CUA-World-Long with 200 challenging tasks requiring 500+ steps where even GPT-5.4 achieves only 27.5% pass rate
Applied Contamination Filtering to prevent data leakage between training and evaluation sets
Used containerized execution supporting Linux, Windows, and Android environments

Scalability Benefits:

Performance scales log-linearly with both software count and task count
Parallel creation-audit workflows enable rapid environment expansion
Standardized processes ensure consistency across diverse software domains
Trajectory Distillation allows smaller 2B models trained on generated environments to outperform models 2× their size

Quality Metrics:

Independent verification through audit agents prevents self-validation issues
Ground-truth validation using privileged information embedded in setup scripts
Systematic quality control enables reliable cross-software evaluation benchmarks
Test-time auditing catches incomplete work and improves task completion accuracy

Relationships

Computer-Use Agents — primary consumers of multi-agent created environments for training and evaluation
GDP-Grounded Software Selection — methodology for selecting economically valuable software applications
Privileged Information Verification — evaluation technique using ground-truth data from setup scripts
Test-Time Auditing — independent agent reviews of completed trajectories to catch errors
Long-Horizon Task Planning — particularly benefits from verified complex environment setups requiring hundreds of steps
Creation-Audit Loop — iterative process enabling quality improvement through feedback
Behavioral Pattern Analysis — automated analysis of trajectories to identify success and failure patterns
Checklist-Based VLM Verification — specific implementation technique used by audit agents
Contamination Filtering — systematic approach to prevent data leakage in generated environments
Trajectory Distillation — training approach using successful trajectories from multi-agent created environments

Sources

sources/arxiv-260406126 — introduced Gym-Anything framework demonstrating multi-agent environment creation at scale, including creation-audit loops, test-time auditing, and CUA-World benchmark generation