Error Taxonomy

Summary: Systematic classification of agent failure modes that distinguishes between different types of errors to improve evaluation and debugging. Essential for building robust verifiers that can accurately assess agent performance across diverse failure scenarios and reduce false positive rates in trajectory evaluation.

Overview

Error taxonomy provides a structured framework for categorizing the various ways Computer Use Agents can fail during task execution. This classification system enables more precise evaluation by distinguishing between failures that reflect agent capability versus environmental constraints, and between execution quality versus goal achievement.

The taxonomy serves as a foundation for Trajectory Verification systems, allowing verifiers to apply appropriate evaluation criteria based on the specific type of failure encountered. By systematically categorizing errors, developers can identify patterns in agent behavior and target improvements more effectively. Microsoft Research's Universal Verifier demonstrates how proper error taxonomy can achieve Cohen's κ ≈ 0.7 with humans while reducing False Positive Rate from 45%+ to 1-8%.

Key Details

Primary Error Categories:

Process errors — failures in execution quality regardless of outcome (e.g., inefficient navigation, incorrect intermediate steps, poor interaction patterns)
Outcome errors — failures to achieve the stated goal despite proper execution methodology
Controllable failures — errors within the agent's control (e.g., misreading instructions, logical mistakes, poor decision-making)
Uncontrollable failures — errors due to environmental factors (e.g., website timeouts, unavailable products, external service disruptions)

Hallucination Categories:

Action hallucinations — claiming to perform actions that weren't actually taken in the trajectory
State hallucinations — misreporting the current state of the interface or environment
Result fabrications — inventing outcomes or data not supported by screenshot evidence
Contradiction errors — providing conflicting information across different time steps

Context Management Errors:

Screenshot misinterpretation — incorrectly parsing visual information or interface elements
Temporal confusion — mixing up information from different time steps in the trajectory
Relevance filtering failures — focusing on irrelevant visual elements while missing important ones
Truncation artifacts — errors caused by incomplete context due to screenshot limits

Conditional Logic Errors:

Prerequisite failures — not checking if task conditions are met before proceeding
Fallback handling — improper response when primary approach isn't viable (e.g., "buy organic if available, else non-organic")
Adaptation errors — failing to modify strategy when environmental conditions change
Conditional criteria misapplication — incorrectly applying rubric criteria when task conditions aren't satisfied

Structured Evaluation Benefits:

Non-overlapping categories — specific, mutually exclusive error types prevent double-counting
Rubric adaptation — evaluation criteria adjust based on error type and environmental factors
Two-pass verification — comparing agent responses with and without visual context to detect fabrications

Relationships

Trajectory Verification — uses error taxonomy to apply appropriate evaluation criteria and achieve human-level agreement
Process vs Outcome Rewards — maps directly to process/outcome error classification, enabling separate evaluation streams
Hallucination Detection — targets specific categories of fabrication errors through two-pass scoring methodology
Rubric Design — incorporates error taxonomy to create comprehensive, non-overlapping evaluation frameworks
Computer Use Agents — subject to systematic categorization of their failure modes for improved debugging
False Positive Rate — dramatically reduced through precise error classification and appropriate handling of different failure types
Screenshot Context Management — addresses visual interpretation error categories through relevance matrices and context selection
Inter-annotator Agreement — improved when evaluators use consistent error classification (Cohen's κ ≈ 0.7)
WebVoyager — previous system with high false positive rates (45%+) due to inadequate error taxonomy
WebJudge — previous system with moderate false positive rates (22%+) lacking systematic error classification
Multimodal LLMs — benefit from structured error taxonomy for better trajectory understanding and evaluation

Sources

sources/the-art-of-building-verifiers-for-computer-use-agents — provided comprehensive framework for categorizing agent failures, distinction between process/outcome errors, hallucination detection taxonomy, and demonstrated significant false positive rate reduction through systematic error classification