Error Taxonomy
Summary: Systematic classification of agent failure modes that distinguishes between different types of errors to improve evaluation and debugging. Essential for building robust verifiers that can accurately assess agent performance across diverse failure scenarios and reduce false positive rates in trajectory evaluation.
Overview
Error taxonomy provides a structured framework for categorizing the various ways Computer Use Agents can fail during task execution. This classification system enables more precise evaluation by distinguishing between failures that reflect agent capability versus environmental constraints, and between execution quality versus goal achievement.
The taxonomy serves as a foundation for Trajectory Verification systems, allowing verifiers to apply appropriate evaluation criteria based on the specific type of failure encountered. By systematically categorizing errors, developers can identify patterns in agent behavior and target improvements more effectively. Microsoft Research's Universal Verifier demonstrates how proper error taxonomy can achieve Cohen's κ ≈ 0.7 with humans while reducing False Positive Rate from 45%+ to 1-8%.
Key Details
Primary Error Categories:
- Process errors — failures in execution quality regardless of outcome (e.g., inefficient navigation, incorrect intermediate steps, poor interaction patterns)
- Outcome errors — failures to achieve the stated goal despite proper execution methodology
- Controllable failures — errors within the agent's control (e.g., misreading instructions, logical mistakes, poor decision-making)
- Uncontrollable failures — errors due to environmental factors (e.g., website timeouts, unavailable products, external service disruptions)
Hallucination Categories:
- Action hallucinations — claiming to perform actions that weren't actually taken in the trajectory
- State hallucinations — misreporting the current state of the interface or environment
- Result fabrications — inventing outcomes or data not supported by screenshot evidence
- Contradiction errors — providing conflicting information across different time steps
Context Management Errors:
- Screenshot misinterpretation — incorrectly parsing visual information or interface elements
- Temporal confusion — mixing up information from different time steps in the trajectory
- Relevance filtering failures — focusing on irrelevant visual elements while missing important ones
- Truncation artifacts — errors caused by incomplete context due to screenshot limits
Conditional Logic Errors:
- Prerequisite failures — not checking if task conditions are met before proceeding
- Fallback handling — improper response when primary approach isn't viable (e.g., "buy organic if available, else non-organic")
- Adaptation errors — failing to modify strategy when environmental conditions change
- Conditional criteria misapplication — incorrectly applying rubric criteria when task conditions aren't satisfied
Structured Evaluation Benefits:
- Non-overlapping categories — specific, mutually exclusive error types prevent double-counting
- Rubric adaptation — evaluation criteria adjust based on error type and environmental factors
- Two-pass verification — comparing agent responses with and without visual context to detect fabrications
Relationships
- Trajectory Verification — uses error taxonomy to apply appropriate evaluation criteria and achieve human-level agreement
- Process vs Outcome Rewards — maps directly to process/outcome error classification, enabling separate evaluation streams
- Hallucination Detection — targets specific categories of fabrication errors through two-pass scoring methodology
- Rubric Design — incorporates error taxonomy to create comprehensive, non-overlapping evaluation frameworks
- Computer Use Agents — subject to systematic categorization of their failure modes for improved debugging
- False Positive Rate — dramatically reduced through precise error classification and appropriate handling of different failure types
- Screenshot Context Management — addresses visual interpretation error categories through relevance matrices and context selection
- Inter-annotator Agreement — improved when evaluators use consistent error classification (Cohen's κ ≈ 0.7)
- WebVoyager — previous system with high false positive rates (45%+) due to inadequate error taxonomy
- WebJudge — previous system with moderate false positive rates (22%+) lacking systematic error classification
- Multimodal LLMs — benefit from structured error taxonomy for better trajectory understanding and evaluation
Sources
- sources/the-art-of-building-verifiers-for-computer-use-agents — provided comprehensive framework for categorizing agent failures, distinction between process/outcome errors, hallucination detection taxonomy, and demonstrated significant false positive rate reduction through systematic error classification