Rubric Generation

Summary: The process of transforming task descriptions into structured evaluation criteria with specific, non-overlapping components. Microsoft Research identifies this as a critical design principle for building reliable verifiers that achieve human-level agreement in evaluating computer use agents.

Overview

Rubric generation involves creating systematic evaluation frameworks from high-level task descriptions. The key insight is that effective rubrics must contain specific, non-overlapping criteria rather than vague or redundant evaluation points. This structured approach enables automated verifiers to achieve Cohen's κ≈0.7 agreement with human evaluators, matching the level at which humans agree with each other.

The process requires careful decomposition of tasks into measurable components that can be independently verified. Rather than using broad criteria like "task completion," effective rubrics break down evaluation into granular elements such as specific actions taken, intermediate states achieved, and observable outcomes.

Key Details

  • Human-level agreement: Properly generated rubrics enable verifiers to achieve Cohen's κ≈0.7 with human evaluators
  • Specificity requirement: Criteria must be concrete and measurable rather than subjective or vague
  • Non-overlapping principle: Each rubric component should evaluate distinct aspects to avoid double-counting errors
  • Screenshot integration: Each rubric criterion is scored against relevant screenshots using a relevance matrix approach
  • Top-k selection: System selects most relevant screenshots per criterion to avoid "needle-in-haystack" problems
  • Failure mode mitigation: Well-structured rubrics reduce phantom criteria, cascading errors, and confirmation bias
  • Dramatic improvement: Reduces false positive rates from 30%+ to 1-8% compared to less structured approaches

Relationships

  • Trajectory Verification — rubrics provide the structured criteria for evaluating agent execution sequences
  • Process vs Outcome Rewards — rubrics must separately define criteria for execution quality and goal achievement
  • Screenshot Analysis — rubric criteria are matched against visual evidence through relevance scoring
  • Human-AI Agreement — properly generated rubrics are essential for achieving human-level evaluator alignment
  • Hallucination Detection — specific rubric criteria enable systematic identification of agent claims contradicted by evidence
  • Computer Use Agents — rubrics provide the evaluation framework for assessing autonomous computer operation
  • Error Taxonomy — rubric generation must account for systematic classification of failure modes

Sources