Reward Design

Summary: Framework for creating appropriate reward signals in reinforcement learning systems that handles both deterministic tasks with verifiable outcomes and open-ended scenarios requiring generative evaluation models.

Overview

Reward design is a critical component of Reinforcement Learning systems that determines how agents learn to optimize their behavior. The framework encompasses two primary approaches: verifiable rewards for tasks with clear, deterministic success criteria, and generative outcome reward models for complex, open-ended scenarios where success is subjective or multi-faceted.

In deterministic tasks, verifiable rewards provide binary or scalar feedback based on objective criteria—such as successfully completing a file operation or navigating to a specific webpage. These rewards can be automatically computed by checking final states against expected outcomes.

For open-ended scenarios, generative outcome reward models use sophisticated evaluation systems, often leveraging Vision-Language Models or specialized scoring networks, to assess the quality of agent performance on tasks that lack clear success metrics. This approach is particularly important for GUI Agents and complex interactive environments where human-like judgment is required to evaluate outcomes.

Key Details

Deterministic Task Rewards: Binary success/failure signals or scalar scores based on objective completion criteria, automatically verifiable through state comparison
Generative Reward Models: Neural networks trained to evaluate complex outcomes, often using human preference data or expert demonstrations as training signals
Multi-Turn Applications: Reward design must account for delayed gratification and sparse reward signals in Multi-Turn Reinforcement Learning scenarios
Domain Adaptation: Different reward structures required for specialized domains like browser navigation, mobile interactions, and desktop GUI tasks
Reward Shaping: Techniques to provide intermediate feedback signals that guide learning while maintaining optimal policy convergence
Verification Systems: Automated checking mechanisms for deterministic tasks that can validate completion without human intervention

Relationships

Multi-Turn Reinforcement Learning — requires careful reward timing and sparse signal handling
GUI Agents — rely on both verifiable GUI state rewards and generative evaluation for complex task completion
Data Flywheel — reward quality directly impacts the filtering and selection of training trajectories
Proximal Policy Optimization — reward signals drive the policy gradient updates in PPO-based training
Vision-Language Models — often used as components in generative reward models for visual task evaluation
Interactive Environments — provide the context and state information needed for reward computation

Sources

sources/ui-tars-2-technical-report — framework implementation in GUI agent training with examples of both verifiable and generative reward approaches