Test-Time Auditing
Summary: A verification technique where an independent agent reviews completed trajectories and provides feedback on missing work, identifying incomplete tasks, missed requirements, and premature completion claims. This multi-agent approach creates a corrective feedback loop that significantly improves task completion rates for complex computer-use scenarios.
Overview
Test-Time Auditing emerged as a critical performance enhancement technique in the Gym-Anything framework for improving Computer-Use Agents. The approach addresses a fundamental failure mode where agents prematurely declare tasks complete without fulfilling all requirements, leading to false success claims and incomplete work.
The auditing process employs an independent model that systematically examines the entire trajectory of the primary agent's actions. This separation ensures objective evaluation without confirmation bias. The audit agent specifically analyzes whether all task requirements have been met, identifies incomplete work, and detects cases where the primary agent incorrectly claims task completion.
When discrepancies are identified, the audit agent provides targeted feedback to guide the primary agent toward proper task completion. This creates a verification loop that significantly improves overall performance. In CUA-World evaluations, test-time auditing improved Gemini-3-Flash performance from 11.5% to 14.0% on long-horizon tasks—a 22% relative improvement that demonstrates the technique's practical value.
The technique proves particularly valuable for Long-Horizon Task Planning scenarios involving hundreds of interaction steps, where complex multi-step processes can easily be abandoned before full completion. The independent review process catches these premature terminations and provides corrective guidance to ensure thorough task fulfillment.
Key Details
- Independent Agent Architecture: Uses a completely separate model from the primary agent to eliminate confirmation bias and ensure objective evaluation
- Complete Trajectory Analysis: Reviews the full sequence of agent actions rather than evaluating only final states or outcomes
- Premature Termination Detection: Specifically designed to identify cases where agents claim completion while requirements remain unfulfilled
- Targeted Feedback Mechanism: Provides specific corrective guidance to help primary agents complete unfinished work
- Demonstrated Performance Gains: Improved Gemini-3-Flash from 11.5% to 14.0% success rate on CUA-World-Long tasks (22% relative improvement)
- Multi-Step Process Verification: Particularly effective for complex tasks requiring 500+ interaction steps where completion verification is challenging
- Integration with Creation-Audit Loop: Utilizes similar principles to the Creation-Audit Loop methodology used in Multi-Agent Environment Creation
- Scalable Implementation: Works across diverse software environments and task types without requiring task-specific customization
Relationships
- Computer-Use Agents — primary beneficiary of audit-based performance improvements and false completion detection
- Multi-Agent Environment Creation — employs similar creation-audit loop principles where separate agents build and verify environments
- Gym-Anything — framework where test-time auditing was developed, validated, and integrated as a core performance enhancement
- CUA-World — benchmark used to demonstrate auditing effectiveness with measurable performance improvements
- CUA-World-Long — specific benchmark subset (500+ step tasks) where auditing showed 22% relative improvement
- Long-Horizon Task Planning — task category that particularly benefits from auditing verification due to complexity and multi-step requirements
- Privileged Information Verification — complementary evaluation approach using ground-truth data embedded in setup scripts for assessment
- Checklist-Based VLM Verification — related structured evaluation technique using vision-language models with detailed assessment rubrics
- Trajectory Distillation — benefits from improved trajectory quality through audit-verified completions for training smaller models
- Creation-Audit Loop — foundational methodology that inspired the test-time auditing approach for trajectory verification
Sources
- sources/arxiv-260406126 — introduced test-time auditing as performance improvement technique in Gym-Anything framework, demonstrated 22% relative improvement on CUA-World-Long tasks