source: "raw/articles/cua-suite-massive-human-annotated-video-demonstrations-for-computer-use-agents.md"

Summary: CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

TL;DR: CUA-Suite introduces the largest open expert video corpus for desktop computer use, comprising 55 hours of continuous 30fps recordings across 10,000 tasks and 87 professional applications, with dense annotations to train and evaluate computer-use agents.

Key Points

Scale: VideoCUA provides ~55 hours of continuous 30fps expert video demonstrations (6 million frames) across 10,000 tasks and 87 applications - 2.5x larger than existing datasets
Quality: Human-curated trajectories with multi-layered reasoning annotations averaging 497 words per step, including observations, thought chains, action descriptions, and reflections
Coverage: Focuses on professional desktop applications across 12 categories (development tools, creative software, productivity suites, etc.) where current models struggle most
Evaluation Results: Current foundation action models achieve only 37.7% accuracy at 50-pixel threshold and 57.6% human-verified stepwise accuracy on desktop tasks
Three-Component Ecosystem:
- VideoCUA: Continuous video demonstrations with dense annotations
- GroundCUA: 56K annotated screenshots with 3.6M UI element annotations for grounding
- UI-Vision: 450-task benchmark testing grounding and planning capabilities
Key Bottleneck: Spatial grounding identified as primary limitation - models struggle with complex multi-panel desktop interfaces despite recent progress (top models reach 47.7% on UI-Vision vs previous 25.5%)
Format Compatibility: Video data can be losslessly converted to formats used by existing frameworks (OpenCUA, ScaleCUA) while preserving temporal dynamics
Open Source: All data, benchmarks, and models released publicly to accelerate research

Concepts Covered

Computer-Use Agents — comprehensive training and evaluation framework for desktop automation agents
Video-Based Agent Training — continuous 30fps recordings vs sparse screenshot approaches for preserving temporal dynamics
UI Element Grounding — pixel-precise localization of interface elements, identified as primary bottleneck
Multi-layered Reasoning Annotations — rich supervisory signal with observations, thoughts, actions, and reflections
Professional Desktop Applications — focus on complex software (IDEs, creative tools, CAD) where agents struggle most
Foundation Action Models — evaluation of OpenCUA and similar models on desktop tasks
Visual World Models — enabling action-conditioned video generation for lookahead planning
Continuous Spatial Control — learning human-like cursor movements vs discrete coordinate prediction

Figures and Images

Figure 1: CUA-Suite overview showing data collection pipeline from video recording to three-component ecosystem
Figure 2: Representative prediction failures showing cross-panel confusion in professional applications (Krita, FreeCAD, Inkscape, OBS Studio)
Table 1: Element grounding performance on UI-Vision benchmark across multiple models
Table 2: Comprehensive comparison of VideoCUA with existing GUI trajectory datasets
Table 3: Action prediction results for OpenCUA models on VideoCUA tasks
Appendix D: Complete trajectory examples showing detailed annotations for Krita and GIMP tasks

source: "raw/articles/cua-suite-massive-human-annotated-video-demonstrations-for-computer-use-agents.md"

Summary: CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Key Points

Concepts Covered

Figures and Images

Related Concepts