Checklist-Based VLM Verification

Summary: A structured evaluation methodology that uses vision-language models to assess task completion through detailed rubrics with weighted subtasks and partial credit scoring. Enhanced with privileged information extraction from setup scripts, this approach enables systematic verification of complex multi-step processes in computer-use agent environments.

Overview

Checklist-Based VLM Verification is an evaluation framework that breaks down complex tasks into granular, weighted components that can be systematically verified using vision-language models. Rather than relying on simple binary pass/fail metrics, this approach provides detailed rubrics with subtasks that can receive partial credit, enabling more nuanced assessment of agent performance.

The methodology is particularly valuable for evaluating Computer-Use Agents where tasks may involve multiple steps, intermediate states, and complex visual outputs that require sophisticated verification beyond simple string matching or pixel comparison. By leveraging VLMs' ability to understand both visual and textual context, the system can assess whether specific requirements have been met even when the exact implementation details vary.

A key enhancement to this approach involves using Privileged Information Verification, where ground-truth data extracted from setup scripts provides additional verification context that agents don't have access to during task execution. This privileged information helps ensure more reliable evaluation by cross-referencing observable outcomes with expected internal states.

Key Details

  • Weighted Subtasks: Tasks are decomposed into multiple components with assigned weights reflecting their relative importance to overall completion
  • Partial Credit Scoring: Agents can receive credit for partially completing complex tasks, providing more informative feedback than binary evaluation
  • VLM-Based Assessment: Uses vision-language models to interpret screenshots and assess completion criteria that may be difficult to verify programmatically
  • Detailed Rubrics: Structured evaluation criteria that specify exactly what constitutes successful completion for each subtask component
  • Privileged Information Integration: Leverages ground-truth data from setup scripts that agents don't see during execution for enhanced verification accuracy
  • Cross-Software Applicability: Can be applied across different software environments within frameworks like CUA-World Benchmark
  • Multi-Agent Verification: Works in conjunction with Creation-Audit Loop systems where audit agents verify both environment quality and task completion
  • Performance Insights: Reveals that even frontier models like Gemini-3-Flash achieve only modest success rates when evaluated through comprehensive checklists
  • Contamination Prevention: Includes systematic filtering to prevent data leakage between training and evaluation sets

The approach addresses limitations of traditional automated testing by handling visual verification tasks that require understanding of UI elements, data relationships, and contextual appropriateness of outputs across diverse software applications.

Relationships

Sources

  • sources/arxiv-260406126 — introduced as core evaluation methodology within the Gym-Anything framework, demonstrating integration with privileged information verification and application across diverse software environments in CUA-World benchmark