Trajectory Verification

Summary: Trajectory verification is the process of evaluating whether agent execution sequences succeeded or failed by analyzing their actions, outcomes, and adherence to task requirements. This evaluation distinguishes between execution quality (process) and goal achievement (outcome) to provide comprehensive assessment of agent performance.

Overview

Trajectory verification addresses the critical challenge of automatically assessing Computer Use Agents performance without requiring expensive human evaluation for every execution. The process involves analyzing complete execution sequences (trajectories) to determine success or failure across multiple dimensions.

Microsoft Research's Universal Verifier represents the state-of-the-art approach, achieving human-level agreement (Cohen's κ≈0.7) by implementing four core design principles: structured non-overlapping rubric criteria, separation of Process vs Outcome Rewards, distinction between controllable and uncontrollable failure factors, and comprehensive Screenshot Context Management.

The verification process relies heavily on Multimodal LLMs, combining visual evidence from screenshots with textual action logs to detect discrepancies and validate agent claims. This approach is essential for identifying hallucinations where agents report actions or results contradicted by visual evidence.

A key innovation is the two-pass scoring system that evaluates trajectories both with and without screenshots to catch agent fabrications. The Universal Verifier also employs a screenshot relevance matrix that selects the top-k most relevant screenshots per rubric criterion rather than truncating sequences or processing all visual data indiscriminately.

Key Details

Performance Metrics:

  • Universal Verifier achieves Cohen's κ≈0.7 agreement with humans, matching inter-annotator agreement levels
  • False positive rates dramatically reduced from 45%+ (WebVoyager) and 22%+ (WebJudge) to 1-8%
  • Human-AI Agreement reaches levels comparable to inter-human agreement

Core Components:

  • Rubric Design: Creates specific, non-overlapping evaluation criteria from task descriptions with conditional criteria handling
  • Screenshot relevance matrix: Scores each screenshot against rubric criteria to select top-k most relevant per criterion
  • Error Taxonomy: Systematic classification covering selection errors, execution failures, critical point misses, and side effects
  • Conditional criteria handling: Adapts evaluation when task conditions aren't met (e.g., "buy organic if available, else non-organic")

Evaluation Dimensions:

  • Process vs Outcome Rewards: Process measures execution quality; outcome measures goal achievement - can diverge when environment blocks success
  • Controllable vs uncontrollable factors: Distinguishes agent errors from environment limitations
  • Hallucination Detection: Two-pass scoring identifies agent fabrications and contradictions

Benchmarks:

  • CUAVerifierBench: First benchmark specifically for measuring verifier quality with 246 trajectories containing both process and outcome human labels
  • Auto-research Agents: Demonstrated 70% expert quality performance in 5% of expert time using trajectory verification

Relationships

  • Computer Use Agents — the autonomous systems whose execution sequences require verification
  • Process vs Outcome Rewards — the dual evaluation framework separating execution quality from goal achievement
  • Screenshot Context Management — efficient processing of visual evidence across long interaction sequences
  • Hallucination Detection — key capability for identifying false agent claims through visual contradiction
  • Rubric Design — structured methodology for creating evaluation criteria with conditional handling
  • Multimodal LLMs — the assessment approach combining visual and textual evidence analysis
  • Inter-annotator Agreement — metric for measuring verifier reliability using Cohen's kappa
  • False Positive Rate — critical metric measuring incorrect success classifications
  • WebVoyager — previous verifier system with higher false positive rates
  • WebJudge — another baseline verifier system improved upon by Universal Verifier
  • Agent Evaluation — broader field encompassing trajectory verification methods
  • Visual Grounding — underlying capability needed for screenshot-based verification

Sources