Rubric Generation

Summary: The process of transforming task descriptions into structured evaluation criteria with specific, non-overlapping components. Microsoft Research identifies this as a critical design principle for building reliable verifiers that achieve human-level agreement in evaluating computer use agents.

Overview

Rubric generation involves creating systematic evaluation frameworks from high-level task descriptions. The key insight is that effective rubrics must contain specific, non-overlapping criteria rather than vague or redundant evaluation points. This structured approach enables automated verifiers to achieve Cohen's κ≈0.7 agreement with human evaluators, matching the level at which humans agree with each other.

The process requires careful decomposition of tasks into measurable components that can be independently verified. Rather than using broad criteria like "task completion," effective rubrics break down evaluation into granular elements such as specific actions taken, intermediate states achieved, and observable outcomes.

Key Details

Human-level agreement: Properly generated rubrics enable verifiers to achieve Cohen's κ≈0.7 with human evaluators
Specificity requirement: Criteria must be concrete and measurable rather than subjective or vague
Non-overlapping principle: Each rubric component should evaluate distinct aspects to avoid double-counting errors
Screenshot integration: Each rubric criterion is scored against relevant screenshots using a relevance matrix approach
Top-k selection: System selects most relevant screenshots per criterion to avoid "needle-in-haystack" problems
Failure mode mitigation: Well-structured rubrics reduce phantom criteria, cascading errors, and confirmation bias
Dramatic improvement: Reduces false positive rates from 30%+ to 1-8% compared to less structured approaches

Relationships

Trajectory Verification — rubrics provide the structured criteria for evaluating agent execution sequences
Process vs Outcome Rewards — rubrics must separately define criteria for execution quality and goal achievement
Screenshot Analysis — rubric criteria are matched against visual evidence through relevance scoring
Human-AI Agreement — properly generated rubrics are essential for achieving human-level evaluator alignment
Hallucination Detection — specific rubric criteria enable systematic identification of agent claims contradicted by evidence
Computer Use Agents — rubrics provide the evaluation framework for assessing autonomous computer operation
Error Taxonomy — rubric generation must account for systematic classification of failure modes

Sources

sources/the-art-of-building-verifiers-for-computer-use-agents — identified rubric generation as core design principle, demonstrated dramatic FPR reduction and human-level agreement through structured criteria