Cohen's Kappa

Summary: Cohen's Kappa (κ) is a statistical measure of inter-rater agreement that accounts for chance agreement, commonly used to evaluate the reliability of classification systems and human annotators. In verifier systems, achieving κ ≈ 0.7 indicates near-human-level agreement quality.

Overview

Cohen's Kappa measures the degree of agreement between two raters while correcting for the agreement that would occur by chance alone. The metric ranges from -1 to 1, where:

  • κ = 1 indicates perfect agreement
  • κ = 0 indicates agreement no better than chance
  • κ < 0 indicates agreement worse than chance

The formula accounts for both observed agreement and expected agreement: κ = (P₀ - Pₑ) / (1 - Pₑ), where P₀ is observed agreement and Pₑ is expected agreement by chance.

In the context of Computer Use Agents and Trajectory Verification, Cohen's Kappa serves as a benchmark for measuring how well automated verifiers align with human judgment. The Universal Verifier system demonstrates this by achieving κ ≈ 0.7 with human annotators, matching typical Inter-annotator Agreement levels between human evaluators.

Key Details

  • Interpretation thresholds: κ > 0.8 (excellent), 0.6-0.8 (substantial), 0.4-0.6 (moderate), 0.2-0.4 (fair), < 0.2 (poor)
  • Chance correction: Unlike simple accuracy, Kappa accounts for the probability that raters agree by random chance
  • Universal Verifier performance: Achieved κ ≈ 0.7 with humans while dramatically reducing False Positive Rate to 1-8%
  • Benchmark application: Used in CUAVerifierBench to measure verifier quality against human gold standards
  • Multi-class capability: Can handle multiple categories beyond binary agreement/disagreement

Relationships

  • Inter-annotator Agreement — Cohen's Kappa is the primary metric for measuring consistency between human evaluators
  • Trajectory Verification — Kappa validates how well automated verifiers match human judgment on agent performance
  • Computer Use Agents — Essential for evaluating whether verifier systems can reliably assess agent trajectories
  • False Positive Rate — While FPR measures one type of error, Kappa provides overall agreement quality
  • Human-AI Agreement — Quantifies the alignment between human annotators and AI verification systems

Sources