Inter-annotator Agreement

Summary: Inter-annotator agreement measures the consistency between human evaluators when annotating data, typically quantified using Cohen's kappa (κ). It serves as a crucial benchmark for AI system evaluation quality, with κ ≈ 0.7 indicating human-level agreement in complex tasks.

Overview

Inter-annotator agreement quantifies how consistently multiple human evaluators assess the same data. This metric is essential for establishing ground truth in machine learning evaluation, particularly for subjective tasks where perfect agreement is unlikely. Cohen's kappa coefficient accounts for agreement that could occur by chance, providing a more reliable measure than raw percentage agreement.

The metric ranges from -1 to 1, where:

  • κ = 1 indicates perfect agreement
  • κ = 0 indicates agreement at chance level
  • κ < 0 indicates agreement worse than chance

In practice, κ ≈ 0.7 is considered substantial agreement and represents human-level consistency for complex evaluation tasks.

Key Details

Cohen's Kappa Calculation:

  • Adjusts for chance agreement between annotators
  • More robust than simple percentage agreement
  • Standard metric for validating annotation quality

Benchmark Significance:

  • AI systems achieving κ ≈ 0.7 with humans demonstrate human-level evaluation capability
  • Critical for validating Trajectory Verification systems
  • Used to establish reliability of Rubric Design frameworks

Practical Applications:

  • Quality control for training data annotation
  • Validation of AI system evaluation metrics
  • Establishing ground truth for complex tasks like Computer Use Agents assessment

Agreement Thresholds:

  • κ < 0.2: Poor agreement
  • κ = 0.2-0.4: Fair agreement
  • κ = 0.4-0.6: Moderate agreement
  • κ = 0.6-0.8: Substantial agreement
  • κ > 0.8: Almost perfect agreement

Relationships

  • Trajectory Verification — inter-annotator agreement validates the reliability of human judgments used as ground truth for agent evaluation
  • Rubric Design — structured evaluation criteria help improve inter-annotator agreement by reducing subjective interpretation
  • Computer Use Agents — achieving κ ≈ 0.7 between AI verifiers and humans demonstrates human-level evaluation capability
  • False Positive Rate — consistent human annotation is required to accurately measure AI system error rates
  • Human-AI Agreement — inter-annotator agreement serves as the benchmark for measuring how well AI systems match human judgment

Sources