Process vs Outcome Rewards
Summary: A design principle for evaluating computer use agents that separates assessment of execution quality from goal achievement. Process rewards judge how well an agent performed its actions independent of environmental factors, while outcome rewards evaluate whether the user's ultimate goal was met.
Overview
Process vs Outcome Rewards represents a fundamental distinction in Agent Evaluation that addresses the challenge of fairly assessing AI systems operating in unpredictable environments. Traditional evaluation approaches conflate these two dimensions, leading to unreliable assessments when agents execute perfectly but fail due to environmental blockers like network issues or website changes.
The process dimension evaluates execution quality—whether the agent selected appropriate actions, avoided hallucinations, and demonstrated sound reasoning throughout its trajectory. The outcome dimension focuses solely on goal achievement, asking whether the user's stated objective was ultimately satisfied regardless of execution path.
This separation enables more nuanced feedback for agent training and provides clearer diagnostic information about failure modes. An agent might receive high process scores for excellent reasoning and action selection while receiving low outcome scores due to uncontrollable environmental factors. The Universal Verifier system demonstrates this approach's effectiveness, achieving near-human agreement levels while dramatically reducing false positive rates compared to previous evaluation methods.
Key Details
Core Distinction:
- Process rewards: Assess execution quality independent of environment blockers and uncontrollable factors
- Outcome rewards: Judge whether user's goal was achieved, regardless of execution path
- Divergence scenarios: Process and outcome can differ when environment blocks success despite perfect execution
Implementation in Universal Verifier:
- Achieves Cohen's κ≈0.7 through structured rubric criteria
- Dramatically reduces False Positive Rate from 45%+ (WebVoyager) and 22%+ (WebJudge) to 1-8%
- Uses Screenshot Context Management to validate agent claims against visual evidence
- Employs two-pass scoring with/without screenshots for Hallucination Detection
Evaluation Benefits:
- Enables fair assessment when environmental blockers prevent goal achievement
- Provides granular feedback for agent training and improvement
- Distinguishes between agent capability and environmental limitations
- Supports conditional criteria handling when task conditions aren't met (e.g., "buy organic if available, else non-organic")
Benchmark Data:
- CUAVerifierBench includes both process and outcome human labels across 246 trajectories
- First benchmark specifically designed for measuring verifier quality with dual annotation structure
- Enables systematic study of process-outcome correlation patterns and verifier performance
Relationships
- Computer Use Agents — primary application domain for this evaluation approach
- Trajectory Verification — broader framework that implements process vs outcome separation
- Universal Verifier — demonstrates practical implementation achieving human-level agreement
- Screenshot Context Management — provides visual evidence for validating process quality
- Hallucination Detection — key component of process evaluation that catches agent fabrications
- Rubric Design — creates structured, non-overlapping criteria for both process and outcome evaluation
- WebVoyager — previous verifier with high false positive rates that this approach improves upon
- WebJudge — another baseline verifier outperformed by process-outcome separation
- CUAVerifierBench — benchmark specifically designed to evaluate this dual-reward approach
- Inter-annotator Agreement — metric used to validate reliability of separated reward signals
- Multimodal LLMs — underlying technology that enables visual verification of process quality
Sources
- sources/the-art-of-building-verifiers-for-computer-use-agents — introduced the concept and demonstrated its effectiveness in Universal Verifier achieving human-level agreement while reducing false positive rates