GDP-Based Evaluation

Summary: A benchmark design approach that selects software applications and tasks based on their economic importance as measured by U.S. GDP data and occupational impact. This methodology ensures evaluation environments reflect real-world economic significance rather than researcher convenience.

Overview

GDP-Based Evaluation represents a paradigm shift in how researchers construct benchmarks for Computer-Use Agents. Instead of selecting software applications based on availability, ease of implementation, or researcher familiarity, this approach uses economic data to identify the most impactful software tools in the real world.

The methodology leverages U.S. occupational data tied to GDP contributions to systematically select software applications. By grounding benchmark construction in economic reality, this approach ensures that agent evaluation reflects the actual landscape of software tools that drive economic productivity. This addresses a critical gap in traditional benchmarking where convenience often trumps real-world relevance.

The approach was pioneered in the creation of CUA-World, which used GDP data to select 200 software applications across all 22 Standard Occupational Classification (SOC) major occupation groups, resulting in over 10,000 tasks that accurately represent the software ecosystem's economic importance.

Key Details

  • Economic Grounding: Software selection based on occupational GDP contribution data rather than technical convenience or availability
  • Comprehensive Coverage: Spans all 22 SOC major occupation groups to ensure no significant economic sectors are overlooked
  • Scale Achievement: Successfully scaled to 200+ software applications and 10,000+ tasks while maintaining economic relevance
  • Real-World Alignment: Tasks reflect actual software usage patterns in economically significant occupations
  • Benchmark Integrity: Prevents researcher bias toward easily accessible or well-documented software
  • Cross-Industry Representation: Ensures evaluation environments span diverse economic sectors from finance to manufacturing to healthcare

The GDP-based approach revealed significant performance gaps in current Computer-Use Agents, with even frontier models achieving only 22.6% success rates, suggesting that economically important software presents more realistic challenges than convenience-selected alternatives.

Relationships

Sources

  • sources/arxiv-260406126 — introduced GDP-based evaluation methodology and demonstrated its implementation in CUA-World benchmark creation