Agent Evaluation

Summary: Methods and frameworks for assessing AI agent performance across different tasks and environments. Includes both process-based evaluation (how well agents execute) and outcome-based evaluation (whether goals are achieved), with particular focus on trajectory verification for computer use agents.

Overview

Agent evaluation encompasses the systematic assessment of AI agent performance using standardized metrics, benchmarks, and verification systems. Modern agent evaluation has evolved beyond simple success/failure metrics to include sophisticated rubric-based systems that can distinguish between execution quality and goal achievement, detect hallucinations, and provide human-level agreement in assessment.

The field addresses key challenges in evaluating autonomous systems that operate in complex, multi-step environments where traditional metrics may not capture nuanced performance differences. This is particularly important for Computer Use Agents that interact with visual interfaces and must be evaluated across diverse task contexts.

Key Details

Evaluation Dimensions:

  • Process rewards — measure execution quality and adherence to best practices
  • Outcome rewards — assess whether the agent achieved its stated goals
  • These can diverge when environmental factors prevent success despite good execution

Universal Verifier Performance:

  • Achieves Cohen's κ ≈ 0.7 with humans, matching inter-annotator agreement levels
  • Reduces false positive rates from 45%+ (WebVoyager) and 22%+ (WebJudge) to 1-8%
  • Uses structured rubrics with specific, non-overlapping criteria

Design Principles:

  • Separate controllable vs uncontrollable failures in agent performance
  • Effective Screenshot Context Management using relevance matrices
  • Two-pass Hallucination Detection (with/without visual evidence)
  • Conditional criteria handling for adaptive task requirements

Benchmarking:

  • CUAVerifierBench provides first specialized benchmark for verifier quality
  • Includes both process and outcome human labels for comprehensive evaluation
  • Auto-research agents can reach 70% expert quality in 5% of expert time

Relationships

Sources