← Library
source: "raw/articles/agentsynth-scalable-task-generation-for-generalist-computer-use-agents.md"
Summary: AgentSynth - Scalable Task Generation for Generalist Computer-Use Agents
TL;DR: AgentSynth introduces an automated pipeline that exploits information asymmetry to generate challenging multi-step computer tasks by chaining simple subtasks together, creating over 6,000 diverse tasks at $0.6 per trajectory.
Key Points
- Core Innovation: Exploits information asymmetry - tasks are easy to generate step-by-step but hard to solve all at once
- Pipeline Architecture: Uses 6 LLM-based agents (task proposer, executor, verifier, reviser, follow-up proposer, summarizer)
- Scalable Generation: Produces complex long-horizon tasks by iteratively chaining simple, solvable subtasks
- Cost Efficiency: Achieves $0.6 per trajectory vs. $4-425 for human-annotated datasets
- Difficulty Control: Fine-grained complexity control by varying number of summarized subtasks (levels 1-6)
- Performance Results: SOTA agents drop from 18% success at level 1 to 4% at level 6, showing benchmark difficulty
- Quality Metrics: 88-94% human evaluation scores across feasibility, coherence, persona relevance, and verifier accuracy
- Environment: Built on OSWorld desktop environment with 1920×1080 screenshots and pyautogui actions
- Task Diversity: Spans multiple software applications (60%+ use 2+ apps), with realistic multi-step workflows
Concepts Covered
- Synthetic Data Generation — automated creation of training/evaluation datasets using LLMs
- Information Asymmetry — core principle where forward generation is easier than reverse inference
- Multi-Modal Agents — agents that process visual (screenshots) and text inputs for computer control
- Task Decomposition — breaking complex tasks into simple, executable subtasks
- Long-Horizon Planning — tasks requiring extended action sequences with memory and context
- Computer-Use Agents — AI systems that interact with desktop environments via mouse/keyboard
- Agent Evaluation — automated verification of task completion using LLM-based judges
- Benchmark Construction — systematic creation of evaluation datasets with controllable difficulty
- Human-Computer Interaction — realistic simulation of user workflows across software applications
Figures and Images
- Figure 1: Complete AgentSynth pipeline diagram showing 6-agent workflow with persona input
- Figure 2: Verifier calibration charts showing binary agreement and completion score correlation
- Figure 3: Dataset statistics showing task complexity scaling and software distribution
- Figure 4: Model performance results across difficulty levels for multiple SOTA agents
- Figure 5: Comparison of bare LLMs vs. Agent S3 scaffolding performance