Test-Time Auditing

Summary: A verification technique where an independent agent reviews completed trajectories and provides feedback on missing work, identifying incomplete tasks, missed requirements, and premature completion claims. This multi-agent approach creates a corrective feedback loop that significantly improves task completion rates for complex computer-use scenarios.

Overview

Test-Time Auditing emerged as a critical performance enhancement technique in the Gym-Anything framework for improving Computer-Use Agents. The approach addresses a fundamental failure mode where agents prematurely declare tasks complete without fulfilling all requirements, leading to false success claims and incomplete work.

The auditing process employs an independent model that systematically examines the entire trajectory of the primary agent's actions. This separation ensures objective evaluation without confirmation bias. The audit agent specifically analyzes whether all task requirements have been met, identifies incomplete work, and detects cases where the primary agent incorrectly claims task completion.

When discrepancies are identified, the audit agent provides targeted feedback to guide the primary agent toward proper task completion. This creates a verification loop that significantly improves overall performance. In CUA-World evaluations, test-time auditing improved Gemini-3-Flash performance from 11.5% to 14.0% on long-horizon tasks—a 22% relative improvement that demonstrates the technique's practical value.

The technique proves particularly valuable for Long-Horizon Task Planning scenarios involving hundreds of interaction steps, where complex multi-step processes can easily be abandoned before full completion. The independent review process catches these premature terminations and provides corrective guidance to ensure thorough task fulfillment.

Key Details

Independent Agent Architecture: Uses a completely separate model from the primary agent to eliminate confirmation bias and ensure objective evaluation
Complete Trajectory Analysis: Reviews the full sequence of agent actions rather than evaluating only final states or outcomes
Premature Termination Detection: Specifically designed to identify cases where agents claim completion while requirements remain unfulfilled
Targeted Feedback Mechanism: Provides specific corrective guidance to help primary agents complete unfinished work
Demonstrated Performance Gains: Improved Gemini-3-Flash from 11.5% to 14.0% success rate on CUA-World-Long tasks (22% relative improvement)
Multi-Step Process Verification: Particularly effective for complex tasks requiring 500+ interaction steps where completion verification is challenging
Integration with Creation-Audit Loop: Utilizes similar principles to the Creation-Audit Loop methodology used in Multi-Agent Environment Creation
Scalable Implementation: Works across diverse software environments and task types without requiring task-specific customization

Relationships

Computer-Use Agents — primary beneficiary of audit-based performance improvements and false completion detection
Multi-Agent Environment Creation — employs similar creation-audit loop principles where separate agents build and verify environments
Gym-Anything — framework where test-time auditing was developed, validated, and integrated as a core performance enhancement
CUA-World — benchmark used to demonstrate auditing effectiveness with measurable performance improvements
CUA-World-Long — specific benchmark subset (500+ step tasks) where auditing showed 22% relative improvement
Long-Horizon Task Planning — task category that particularly benefits from auditing verification due to complexity and multi-step requirements
Privileged Information Verification — complementary evaluation approach using ground-truth data embedded in setup scripts for assessment
Checklist-Based VLM Verification — related structured evaluation technique using vision-language models with detailed assessment rubrics
Trajectory Distillation — benefits from improved trajectory quality through audit-verified completions for training smaller models
Creation-Audit Loop — foundational methodology that inspired the test-time auditing approach for trajectory verification

Sources

sources/arxiv-260406126 — introduced test-time auditing as performance improvement technique in Gym-Anything framework, demonstrated 22% relative improvement on CUA-World-Long tasks