AgentRewardBench
Summary: AgentRewardBench is a benchmark for evaluating agent reward systems, specifically designed to measure how well verifiers can assess computer use agent trajectories. It represents the first benchmark to systematically evaluate verifier quality with both process and outcome human labels.
Overview
AgentRewardBench addresses a critical gap in agent evaluation by providing standardized metrics for assessing Trajectory Verification systems. Traditional agent benchmarks focus on task completion rates, but AgentRewardBench specifically targets the verifiers that determine whether agents have succeeded or failed. This is crucial because poor verifiers can lead to high false positive rates - incorrectly labeling failed trajectories as successful.
The benchmark builds on principles from Microsoft Research's Universal Verifier system, which achieved near-human agreement levels (Cohen's κ ≈ 0.7) by implementing structured evaluation approaches. AgentRewardBench extends these concepts to create reproducible standards for verifier performance across different agent types and tasks.
Key Details
- First verifier-focused benchmark: Unlike task completion benchmarks, specifically measures verifier accuracy and agreement
- Dual labeling system: Provides both process and outcome human labels for comprehensive evaluation
- Human agreement baseline: Uses Inter-annotator Agreement as the gold standard for verifier performance
- False positive emphasis: Particularly targets reducing incorrect success classifications that can mislead agent training
- Multi-modal evaluation: Incorporates Screenshot Context Management to test visual evidence processing
- Rubric-based framework: Uses Rubric Design principles with specific, non-overlapping criteria
- Hallucination Detection: Tests verifier ability to identify agent fabrications and contradictions
Relationships
- Universal Verifier — the system that demonstrated principles now codified in AgentRewardBench
- Computer Use Agents — the type of agents whose trajectories this benchmark evaluates
- CUAVerifierBench — closely related benchmark specifically for computer use agent verifiers
- WebVoyager — earlier system with 45%+ false positive rate that AgentRewardBench aims to improve upon
- WebJudge — another system with 22%+ false positive rate addressed by this benchmarking approach
- Agent Evaluation — broader category that AgentRewardBench contributes specialized verifier metrics to
- Multimodal LLMs — the underlying technology powering many verifiers evaluated by this benchmark
Sources
- sources/the-art-of-building-verifiers-for-computer-use-agents — introduced AgentRewardBench as part of comprehensive verifier evaluation framework