Multi-Agent Environment Creation

Summary: A methodology that uses specialized creation and audit agents to automatically build and verify software environments at scale. This approach enables systematic generation of complex interactive environments while maintaining quality through independent validation loops, as demonstrated by the Gym-Anything framework's creation of 10,000+ verified tasks across 200+ software applications.

Overview

Multi-agent environment creation employs a division of labor between specialized agents to address the challenges of automated environment generation at scale. The methodology centers on a creation-audit loop where:

  • A creation agent generates interactive software environments with tasks, setup scripts, and evaluation criteria
  • An audit agent independently verifies environment quality, task feasibility, and correctness of success conditions
  • Memory summarization agents distill successful patterns and failures to improve future iterations

This separation of concerns prevents single-agent biases and ensures robust environment generation. The Gym-Anything framework demonstrates this approach by creating CUA-World, containing 10,103 tasks across 200+ software applications covering all 22 SOC occupation groups.

The methodology addresses critical challenges in automated environment creation:

  • Quality assurance: Independent verification catches errors creators might miss
  • Scalability: Automated processes enable generation of thousands of environments
  • Economic relevance: GDP-Grounded Software Selection ensures focus on high-impact applications
  • Reliability: Privileged Information Verification using setup script data provides ground-truth validation

Performance Impact:

  • Test-Time Auditing improved Gemini-3-Flash performance from 11.5% to 14.0% on Long-Horizon Task Planning tasks
  • Independent audit agents reduced false positive completion rates by catching premature task termination
  • Multi-round creation-audit cycles achieved consistent quality across diverse software domains

Key Details

Architecture Components:

  • Creator agents generate complete task specifications including setup procedures, interaction requirements, and success criteria
  • Auditor agents perform checklist-based verification using Checklist-Based VLM Verification with privileged information from setup scripts
  • Feedback loops enable creators to refine outputs based on audit findings across multiple iterations
  • Memory agents summarize behavioral patterns to improve future environment generation

Implementation in CUA-World:

  • Generated 10,103 verified tasks across healthcare, engineering, finance, and scientific software applications
  • Created CUA-World-Long with 200 challenging tasks requiring 500+ steps where even GPT-5.4 achieves only 27.5% pass rate
  • Applied Contamination Filtering to prevent data leakage between training and evaluation sets
  • Used containerized execution supporting Linux, Windows, and Android environments

Scalability Benefits:

  • Performance scales log-linearly with both software count and task count
  • Parallel creation-audit workflows enable rapid environment expansion
  • Standardized processes ensure consistency across diverse software domains
  • Trajectory Distillation allows smaller 2B models trained on generated environments to outperform models 2× their size

Quality Metrics:

  • Independent verification through audit agents prevents self-validation issues
  • Ground-truth validation using privileged information embedded in setup scripts
  • Systematic quality control enables reliable cross-software evaluation benchmarks
  • Test-time auditing catches incomplete work and improves task completion accuracy

Relationships

Sources

  • sources/arxiv-260406126 — introduced Gym-Anything framework demonstrating multi-agent environment creation at scale, including creation-audit loops, test-time auditing, and CUA-World benchmark generation