Cohen's Kappa

Summary: Cohen's Kappa (κ) is a statistical measure of inter-rater agreement that accounts for chance agreement, commonly used to evaluate the reliability of classification systems and human annotators. In verifier systems, achieving κ ≈ 0.7 indicates near-human-level agreement quality.

Overview

Cohen's Kappa measures the degree of agreement between two raters while correcting for the agreement that would occur by chance alone. The metric ranges from -1 to 1, where:

κ = 1 indicates perfect agreement
κ = 0 indicates agreement no better than chance
κ < 0 indicates agreement worse than chance

The formula accounts for both observed agreement and expected agreement: κ = (P₀ - Pₑ) / (1 - Pₑ), where P₀ is observed agreement and Pₑ is expected agreement by chance.

In the context of Computer Use Agents and Trajectory Verification, Cohen's Kappa serves as a benchmark for measuring how well automated verifiers align with human judgment. The Universal Verifier system demonstrates this by achieving κ ≈ 0.7 with human annotators, matching typical Inter-annotator Agreement levels between human evaluators.

Key Details

Interpretation thresholds: κ > 0.8 (excellent), 0.6-0.8 (substantial), 0.4-0.6 (moderate), 0.2-0.4 (fair), < 0.2 (poor)
Chance correction: Unlike simple accuracy, Kappa accounts for the probability that raters agree by random chance
Universal Verifier performance: Achieved κ ≈ 0.7 with humans while dramatically reducing False Positive Rate to 1-8%
Benchmark application: Used in CUAVerifierBench to measure verifier quality against human gold standards
Multi-class capability: Can handle multiple categories beyond binary agreement/disagreement

Relationships

Inter-annotator Agreement — Cohen's Kappa is the primary metric for measuring consistency between human evaluators
Trajectory Verification — Kappa validates how well automated verifiers match human judgment on agent performance
Computer Use Agents — Essential for evaluating whether verifier systems can reliably assess agent trajectories
False Positive Rate — While FPR measures one type of error, Kappa provides overall agreement quality
Human-AI Agreement — Quantifies the alignment between human annotators and AI verification systems

Sources

sources/the-art-of-building-verifiers-for-computer-use-agents — Demonstrates Cohen's Kappa as validation metric for Universal Verifier achieving human-level agreement in agent trajectory evaluation