WebJudge

Summary: WebJudge is an agent evaluation system for computer use agents that measures trajectory success but suffers from a moderate false positive rate of 22%+. It represents an earlier generation of verifiers that have been superseded by more accurate systems like the Universal Verifier.

Overview

WebJudge serves as an evaluation framework for assessing whether Computer Use Agents successfully complete their intended tasks through Trajectory Verification. The system analyzes agent execution sequences to determine goal achievement, but struggles with accuracy in distinguishing successful from failed trajectories.

The evaluation approach predates more sophisticated verification methodologies that separate Process vs Outcome Rewards and implement robust Hallucination Detection. WebJudge's limitations highlight the challenges in building reliable evaluation systems for complex multi-step agent interactions.

Key Details

  • False Positive Rate: 22%+ - significantly higher than state-of-the-art verifiers
  • Evaluation Focus: Primarily outcome-based assessment without process quality separation
  • Accuracy Limitations: Lower Inter-annotator Agreement compared to human evaluators
  • Historical Context: Represents earlier generation of agent evaluation systems before Universal Verifier improvements

The system's moderate false positive rate means it incorrectly labels approximately 1 in 5 failed trajectories as successful, creating challenges for accurate agent performance measurement and training feedback.

Relationships

Sources