← Library
source: "raw/articles/cua-suite-massive-human-annotated-video-demonstrations-for-computer-use-agents.md"
Summary: CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
TL;DR: CUA-Suite introduces the largest open expert video corpus for desktop computer use, comprising 55 hours of continuous 30fps recordings across 10,000 tasks and 87 professional applications, with dense annotations to train and evaluate computer-use agents.
Key Points
- Scale: VideoCUA provides ~55 hours of continuous 30fps expert video demonstrations (6 million frames) across 10,000 tasks and 87 applications - 2.5x larger than existing datasets
- Quality: Human-curated trajectories with multi-layered reasoning annotations averaging 497 words per step, including observations, thought chains, action descriptions, and reflections
- Coverage: Focuses on professional desktop applications across 12 categories (development tools, creative software, productivity suites, etc.) where current models struggle most
- Evaluation Results: Current foundation action models achieve only 37.7% accuracy at 50-pixel threshold and 57.6% human-verified stepwise accuracy on desktop tasks
- Three-Component Ecosystem:
- VideoCUA: Continuous video demonstrations with dense annotations
- GroundCUA: 56K annotated screenshots with 3.6M UI element annotations for grounding
- UI-Vision: 450-task benchmark testing grounding and planning capabilities
- Key Bottleneck: Spatial grounding identified as primary limitation - models struggle with complex multi-panel desktop interfaces despite recent progress (top models reach 47.7% on UI-Vision vs previous 25.5%)
- Format Compatibility: Video data can be losslessly converted to formats used by existing frameworks (OpenCUA, ScaleCUA) while preserving temporal dynamics
- Open Source: All data, benchmarks, and models released publicly to accelerate research
Concepts Covered
- Computer-Use Agents — comprehensive training and evaluation framework for desktop automation agents
- Video-Based Agent Training — continuous 30fps recordings vs sparse screenshot approaches for preserving temporal dynamics
- UI Element Grounding — pixel-precise localization of interface elements, identified as primary bottleneck
- Multi-layered Reasoning Annotations — rich supervisory signal with observations, thoughts, actions, and reflections
- Professional Desktop Applications — focus on complex software (IDEs, creative tools, CAD) where agents struggle most
- Foundation Action Models — evaluation of OpenCUA and similar models on desktop tasks
- Visual World Models — enabling action-conditioned video generation for lookahead planning
- Continuous Spatial Control — learning human-like cursor movements vs discrete coordinate prediction
Figures and Images
- Figure 1: CUA-Suite overview showing data collection pipeline from video recording to three-component ecosystem
- Figure 2: Representative prediction failures showing cross-panel confusion in professional applications (Krita, FreeCAD, Inkscape, OBS Studio)
- Table 1: Element grounding performance on UI-Vision benchmark across multiple models
- Table 2: Comprehensive comparison of VideoCUA with existing GUI trajectory datasets
- Table 3: Action prediction results for OpenCUA models on VideoCUA tasks
- Appendix D: Complete trajectory examples showing detailed annotations for Krita and GIMP tasks