WebJudge

Summary: WebJudge is an agent evaluation system for computer use agents that measures trajectory success but suffers from a moderate false positive rate of 22%+. It represents an earlier generation of verifiers that have been superseded by more accurate systems like the Universal Verifier.

Overview

WebJudge serves as an evaluation framework for assessing whether Computer Use Agents successfully complete their intended tasks through Trajectory Verification. The system analyzes agent execution sequences to determine goal achievement, but struggles with accuracy in distinguishing successful from failed trajectories.

The evaluation approach predates more sophisticated verification methodologies that separate Process vs Outcome Rewards and implement robust Hallucination Detection. WebJudge's limitations highlight the challenges in building reliable evaluation systems for complex multi-step agent interactions.

Key Details

False Positive Rate: 22%+ - significantly higher than state-of-the-art verifiers
Evaluation Focus: Primarily outcome-based assessment without process quality separation
Accuracy Limitations: Lower Inter-annotator Agreement compared to human evaluators
Historical Context: Represents earlier generation of agent evaluation systems before Universal Verifier improvements

The system's moderate false positive rate means it incorrectly labels approximately 1 in 5 failed trajectories as successful, creating challenges for accurate agent performance measurement and training feedback.

Relationships

Universal Verifier — successor system that reduces WebJudge's false positive rate from 22%+ to 1-8%
WebVoyager — another evaluation system with even higher false positive rate (45%+)
Computer Use Agents — the AI systems that WebJudge evaluates
Trajectory Verification — the general evaluation approach WebJudge implements
False Positive Rate — key metric where WebJudge shows significant limitations
CUAVerifierBench — benchmark that measures verifier quality and reveals WebJudge's limitations
Agent Evaluation — broader field of assessing AI agent performance

Sources

sources/the-art-of-building-verifiers-for-computer-use-agents — provided false positive rate data and comparison with Universal Verifier