Cohen's Kappa
Summary: Cohen's Kappa (κ) is a statistical measure of inter-rater agreement that accounts for chance agreement, commonly used to evaluate the reliability of classification systems and human annotators. In verifier systems, achieving κ ≈ 0.7 indicates near-human-level agreement quality.
Overview
Cohen's Kappa measures the degree of agreement between two raters while correcting for the agreement that would occur by chance alone. The metric ranges from -1 to 1, where:
- κ = 1 indicates perfect agreement
- κ = 0 indicates agreement no better than chance
- κ < 0 indicates agreement worse than chance
The formula accounts for both observed agreement and expected agreement: κ = (P₀ - Pₑ) / (1 - Pₑ), where P₀ is observed agreement and Pₑ is expected agreement by chance.
In the context of Computer Use Agents and Trajectory Verification, Cohen's Kappa serves as a benchmark for measuring how well automated verifiers align with human judgment. The Universal Verifier system demonstrates this by achieving κ ≈ 0.7 with human annotators, matching typical Inter-annotator Agreement levels between human evaluators.
Key Details
- Interpretation thresholds: κ > 0.8 (excellent), 0.6-0.8 (substantial), 0.4-0.6 (moderate), 0.2-0.4 (fair), < 0.2 (poor)
- Chance correction: Unlike simple accuracy, Kappa accounts for the probability that raters agree by random chance
- Universal Verifier performance: Achieved κ ≈ 0.7 with humans while dramatically reducing False Positive Rate to 1-8%
- Benchmark application: Used in CUAVerifierBench to measure verifier quality against human gold standards
- Multi-class capability: Can handle multiple categories beyond binary agreement/disagreement
Relationships
- Inter-annotator Agreement — Cohen's Kappa is the primary metric for measuring consistency between human evaluators
- Trajectory Verification — Kappa validates how well automated verifiers match human judgment on agent performance
- Computer Use Agents — Essential for evaluating whether verifier systems can reliably assess agent trajectories
- False Positive Rate — While FPR measures one type of error, Kappa provides overall agreement quality
- Human-AI Agreement — Quantifies the alignment between human annotators and AI verification systems
Sources
- sources/the-art-of-building-verifiers-for-computer-use-agents — Demonstrates Cohen's Kappa as validation metric for Universal Verifier achieving human-level agreement in agent trajectory evaluation