False Positive Rate
Summary: False Positive Rate (FPR) measures how frequently a verification system incorrectly labels failed trajectories as successful. In computer use agent evaluation, high FPR undermines trust by rewarding agents for poor performance, making it a critical metric for verifier quality.
Overview
False Positive Rate represents a fundamental evaluation challenge in computer use agent systems. When verifiers have high FPR, they create a dangerous feedback loop where agents are rewarded for unsuccessful behavior, leading to degraded performance over time. The metric becomes particularly critical in Trajectory Verification where complex multi-step sequences must be accurately assessed.
Traditional verifiers like WebVoyager and WebJudge suffered from extremely high false positive rates (45%+ and 22%+ respectively), making them unreliable for agent training and evaluation. These high rates occurred because early verification systems struggled with Hallucination Detection, inconsistent Rubric Design, and poor Screenshot Context Management.
Microsoft Research's Universal Verifier demonstrates that dramatic FPR reduction is possible through systematic design improvements. Their approach achieves 1-8% FPR while maintaining Inter-annotator Agreement of Cohen's κ ≈ 0.7, proving that low false positive rates don't require sacrificing accuracy standards.
Key Details
- Baseline performance: WebVoyager shows 45%+ FPR, WebJudge shows 22%+ FPR in trajectory evaluation
- Improved performance: Microsoft's Universal Verifier achieves 1-8% FPR through structured design principles
- Impact on training: High FPR creates corrupted reward signals that degrade agent learning over time
- Measurement context: FPR is measured alongside Inter-annotator Agreement (Cohen's κ ≈ 0.7) to validate verifier quality
- Detection methods: Two-pass scoring (with/without screenshots) helps reduce false positives by catching agent fabrications
- Design factors: Specific, non-overlapping rubrics and separation of Process vs Outcome Rewards significantly reduce FPR
- Four core principles: (1) specific, non-overlapping rubrics; (2) separate process vs outcome rewards; (3) distinguish controllable vs uncontrollable failures; (4) effective context management of all screenshots
- Benchmark validation: CUAVerifierBench provides the first standardized benchmark for measuring verifier FPR with both process and outcome labels
Relationships
- Trajectory Verification — FPR is a key quality metric for trajectory evaluation systems
- Hallucination Detection — reducing agent fabrications directly lowers false positive rates through two-pass scoring methods
- Process vs Outcome Rewards — separating these reward types helps prevent false positives when environment blocks success beyond agent control
- Inter-annotator Agreement — high human agreement validates that low FPR reflects genuine improvement, not just different standards
- Computer Use Agents — FPR directly impacts agent training quality and deployment reliability
- Rubric Design — structured, specific criteria reduce ambiguity that leads to false positives
- Screenshot Context Management — effective selection of relevant visual evidence prevents misinterpretation that causes false positives
- WebVoyager — baseline system with 45%+ FPR demonstrating the severity of false positive problems
- WebJudge — improved but still problematic system with 22%+ FPR
- CUAVerifierBench — benchmark specifically designed to measure and improve verifier FPR
Sources
- sources/the-art-of-building-verifiers-for-computer-use-agents — provided FPR benchmarks, improvement methods, design principles, and Universal Verifier system achieving dramatic FPR reduction