Rubric Design

Summary: Structured evaluation frameworks that define specific, measurable criteria for assessing multi-step agent performance. Effective rubrics separate process execution from outcome achievement and use non-overlapping criteria to reduce evaluation ambiguity.

Overview

Rubric design in Computer Use Agents evaluation involves creating systematic assessment frameworks that break down complex tasks into measurable components. The key innovation is distinguishing between process rewards (how well the agent executes) and outcome rewards (whether the goal was achieved). This separation is crucial because agents can execute perfectly but fail due to uncontrollable environmental factors, or conversely, achieve goals through suboptimal processes.

Effective rubrics use specific, non-overlapping criteria that prevent evaluator confusion and reduce false positives. Rather than vague assessments like "task completed successfully," well-designed rubrics specify granular conditions: "product added to cart," "correct shipping address entered," "payment processed without errors."

Key Details

  • Four core design principles: (1) specific, non-overlapping rubrics; (2) separate process vs outcome rewards; (3) distinguish controllable vs uncontrollable failures; (4) effective context management of all screenshots
  • Cohen's κ ≈ 0.7 agreement: Well-designed rubrics achieve human-level Inter-annotator Agreement between evaluators
  • Dramatic error reduction: Proper rubric design reduces False Positive Rate from 45%+ to 1-8% compared to existing evaluation methods
  • Conditional criteria handling: Rubrics adapt when task conditions aren't met (e.g., "buy organic if available, else non-organic")
  • Screenshot relevance matrix: Selects top-k most relevant screenshots per rubric criterion rather than processing all visual evidence
  • Two-pass scoring structure: Evaluates with and without screenshots to detect agent fabrications

The CUAVerifierBench benchmark demonstrates that structured rubrics enable consistent evaluation across different human annotators and automated verifiers, establishing a foundation for reliable Agent Evaluation at scale.

Relationships

Sources