Inter-annotator Agreement

Summary: Inter-annotator agreement measures the consistency between human evaluators when annotating data, typically quantified using Cohen's kappa (κ). It serves as a crucial benchmark for AI system evaluation quality, with κ ≈ 0.7 indicating human-level agreement in complex tasks.

Overview

Inter-annotator agreement quantifies how consistently multiple human evaluators assess the same data. This metric is essential for establishing ground truth in machine learning evaluation, particularly for subjective tasks where perfect agreement is unlikely. Cohen's kappa coefficient accounts for agreement that could occur by chance, providing a more reliable measure than raw percentage agreement.

The metric ranges from -1 to 1, where:

κ = 1 indicates perfect agreement
κ = 0 indicates agreement at chance level
κ < 0 indicates agreement worse than chance

In practice, κ ≈ 0.7 is considered substantial agreement and represents human-level consistency for complex evaluation tasks.

Key Details

Cohen's Kappa Calculation:

Adjusts for chance agreement between annotators
More robust than simple percentage agreement
Standard metric for validating annotation quality

Benchmark Significance:

AI systems achieving κ ≈ 0.7 with humans demonstrate human-level evaluation capability
Critical for validating Trajectory Verification systems
Used to establish reliability of Rubric Design frameworks

Practical Applications:

Quality control for training data annotation
Validation of AI system evaluation metrics
Establishing ground truth for complex tasks like Computer Use Agents assessment

Agreement Thresholds:

κ < 0.2: Poor agreement
κ = 0.2-0.4: Fair agreement
κ = 0.4-0.6: Moderate agreement
κ = 0.6-0.8: Substantial agreement
κ > 0.8: Almost perfect agreement

Relationships

Trajectory Verification — inter-annotator agreement validates the reliability of human judgments used as ground truth for agent evaluation
Rubric Design — structured evaluation criteria help improve inter-annotator agreement by reducing subjective interpretation
Computer Use Agents — achieving κ ≈ 0.7 between AI verifiers and humans demonstrates human-level evaluation capability
False Positive Rate — consistent human annotation is required to accurately measure AI system error rates
Human-AI Agreement — inter-annotator agreement serves as the benchmark for measuring how well AI systems match human judgment

Sources

sources/the-art-of-building-verifiers-for-computer-use-agents — demonstrated Universal Verifier achieving κ ≈ 0.7 with humans, matching inter-annotator agreement levels for complex agent evaluation tasks