AgentRewardBench

Summary: AgentRewardBench is a benchmark for evaluating agent reward systems, specifically designed to measure how well verifiers can assess computer use agent trajectories. It represents the first benchmark to systematically evaluate verifier quality with both process and outcome human labels.

Overview

AgentRewardBench addresses a critical gap in agent evaluation by providing standardized metrics for assessing Trajectory Verification systems. Traditional agent benchmarks focus on task completion rates, but AgentRewardBench specifically targets the verifiers that determine whether agents have succeeded or failed. This is crucial because poor verifiers can lead to high false positive rates - incorrectly labeling failed trajectories as successful.

The benchmark builds on principles from Microsoft Research's Universal Verifier system, which achieved near-human agreement levels (Cohen's κ ≈ 0.7) by implementing structured evaluation approaches. AgentRewardBench extends these concepts to create reproducible standards for verifier performance across different agent types and tasks.

Key Details

First verifier-focused benchmark: Unlike task completion benchmarks, specifically measures verifier accuracy and agreement
Dual labeling system: Provides both process and outcome human labels for comprehensive evaluation
Human agreement baseline: Uses Inter-annotator Agreement as the gold standard for verifier performance
False positive emphasis: Particularly targets reducing incorrect success classifications that can mislead agent training
Multi-modal evaluation: Incorporates Screenshot Context Management to test visual evidence processing
Rubric-based framework: Uses Rubric Design principles with specific, non-overlapping criteria
Hallucination Detection: Tests verifier ability to identify agent fabrications and contradictions

Relationships

Universal Verifier — the system that demonstrated principles now codified in AgentRewardBench
Computer Use Agents — the type of agents whose trajectories this benchmark evaluates
CUAVerifierBench — closely related benchmark specifically for computer use agent verifiers
WebVoyager — earlier system with 45%+ false positive rate that AgentRewardBench aims to improve upon
WebJudge — another system with 22%+ false positive rate addressed by this benchmarking approach
Agent Evaluation — broader category that AgentRewardBench contributes specialized verifier metrics to
Multimodal LLMs — the underlying technology powering many verifiers evaluated by this benchmark

Sources

sources/the-art-of-building-verifiers-for-computer-use-agents — introduced AgentRewardBench as part of comprehensive verifier evaluation framework