False Positive Rate

Summary: False Positive Rate (FPR) measures how frequently a verification system incorrectly labels failed trajectories as successful. In computer use agent evaluation, high FPR undermines trust by rewarding agents for poor performance, making it a critical metric for verifier quality.

Overview

False Positive Rate represents a fundamental evaluation challenge in computer use agent systems. When verifiers have high FPR, they create a dangerous feedback loop where agents are rewarded for unsuccessful behavior, leading to degraded performance over time. The metric becomes particularly critical in Trajectory Verification where complex multi-step sequences must be accurately assessed.

Traditional verifiers like WebVoyager and WebJudge suffered from extremely high false positive rates (45%+ and 22%+ respectively), making them unreliable for agent training and evaluation. These high rates occurred because early verification systems struggled with Hallucination Detection, inconsistent Rubric Design, and poor Screenshot Context Management.

Microsoft Research's Universal Verifier demonstrates that dramatic FPR reduction is possible through systematic design improvements. Their approach achieves 1-8% FPR while maintaining Inter-annotator Agreement of Cohen's κ ≈ 0.7, proving that low false positive rates don't require sacrificing accuracy standards.

Key Details

Baseline performance: WebVoyager shows 45%+ FPR, WebJudge shows 22%+ FPR in trajectory evaluation
Improved performance: Microsoft's Universal Verifier achieves 1-8% FPR through structured design principles
Impact on training: High FPR creates corrupted reward signals that degrade agent learning over time
Measurement context: FPR is measured alongside Inter-annotator Agreement (Cohen's κ ≈ 0.7) to validate verifier quality
Detection methods: Two-pass scoring (with/without screenshots) helps reduce false positives by catching agent fabrications
Design factors: Specific, non-overlapping rubrics and separation of Process vs Outcome Rewards significantly reduce FPR
Four core principles: (1) specific, non-overlapping rubrics; (2) separate process vs outcome rewards; (3) distinguish controllable vs uncontrollable failures; (4) effective context management of all screenshots
Benchmark validation: CUAVerifierBench provides the first standardized benchmark for measuring verifier FPR with both process and outcome labels

Relationships

Trajectory Verification — FPR is a key quality metric for trajectory evaluation systems
Hallucination Detection — reducing agent fabrications directly lowers false positive rates through two-pass scoring methods
Process vs Outcome Rewards — separating these reward types helps prevent false positives when environment blocks success beyond agent control
Inter-annotator Agreement — high human agreement validates that low FPR reflects genuine improvement, not just different standards
Computer Use Agents — FPR directly impacts agent training quality and deployment reliability
Rubric Design — structured, specific criteria reduce ambiguity that leads to false positives
Screenshot Context Management — effective selection of relevant visual evidence prevents misinterpretation that causes false positives
WebVoyager — baseline system with 45%+ FPR demonstrating the severity of false positive problems
WebJudge — improved but still problematic system with 22%+ FPR
CUAVerifierBench — benchmark specifically designed to measure and improve verifier FPR

Sources

sources/the-art-of-building-verifiers-for-computer-use-agents — provided FPR benchmarks, improvement methods, design principles, and Universal Verifier system achieving dramatic FPR reduction