Process vs Outcome Rewards

Summary: A design principle for evaluating computer use agents that separates assessment of execution quality from goal achievement. Process rewards judge how well an agent performed its actions independent of environmental factors, while outcome rewards evaluate whether the user's ultimate goal was met.

Overview

Process vs Outcome Rewards represents a fundamental distinction in Agent Evaluation that addresses the challenge of fairly assessing AI systems operating in unpredictable environments. Traditional evaluation approaches conflate these two dimensions, leading to unreliable assessments when agents execute perfectly but fail due to environmental blockers like network issues or website changes.

The process dimension evaluates execution quality—whether the agent selected appropriate actions, avoided hallucinations, and demonstrated sound reasoning throughout its trajectory. The outcome dimension focuses solely on goal achievement, asking whether the user's stated objective was ultimately satisfied regardless of execution path.

This separation enables more nuanced feedback for agent training and provides clearer diagnostic information about failure modes. An agent might receive high process scores for excellent reasoning and action selection while receiving low outcome scores due to uncontrollable environmental factors. The Universal Verifier system demonstrates this approach's effectiveness, achieving near-human agreement levels while dramatically reducing false positive rates compared to previous evaluation methods.

Key Details

Core Distinction:

Process rewards: Assess execution quality independent of environment blockers and uncontrollable factors
Outcome rewards: Judge whether user's goal was achieved, regardless of execution path
Divergence scenarios: Process and outcome can differ when environment blocks success despite perfect execution

Implementation in Universal Verifier:

Achieves Cohen's κ≈0.7 through structured rubric criteria
Dramatically reduces False Positive Rate from 45%+ (WebVoyager) and 22%+ (WebJudge) to 1-8%
Uses Screenshot Context Management to validate agent claims against visual evidence
Employs two-pass scoring with/without screenshots for Hallucination Detection

Evaluation Benefits:

Enables fair assessment when environmental blockers prevent goal achievement
Provides granular feedback for agent training and improvement
Distinguishes between agent capability and environmental limitations
Supports conditional criteria handling when task conditions aren't met (e.g., "buy organic if available, else non-organic")

Benchmark Data:

CUAVerifierBench includes both process and outcome human labels across 246 trajectories
First benchmark specifically designed for measuring verifier quality with dual annotation structure
Enables systematic study of process-outcome correlation patterns and verifier performance

Relationships

Computer Use Agents — primary application domain for this evaluation approach
Trajectory Verification — broader framework that implements process vs outcome separation
Universal Verifier — demonstrates practical implementation achieving human-level agreement
Screenshot Context Management — provides visual evidence for validating process quality
Hallucination Detection — key component of process evaluation that catches agent fabrications
Rubric Design — creates structured, non-overlapping criteria for both process and outcome evaluation
WebVoyager — previous verifier with high false positive rates that this approach improves upon
WebJudge — another baseline verifier outperformed by process-outcome separation
CUAVerifierBench — benchmark specifically designed to evaluate this dual-reward approach
Inter-annotator Agreement — metric used to validate reliability of separated reward signals
Multimodal LLMs — underlying technology that enables visual verification of process quality

Sources

sources/the-art-of-building-verifiers-for-computer-use-agents — introduced the concept and demonstrated its effectiveness in Universal Verifier achieving human-level agreement while reducing false positive rates