source: "raw/articles/agentsynth-scalable-task-generation-for-generalist-computer-use-agents.md"

Summary: AgentSynth - Scalable Task Generation for Generalist Computer-Use Agents

TL;DR: AgentSynth introduces an automated pipeline that exploits information asymmetry to generate challenging multi-step computer tasks by chaining simple subtasks together, creating over 6,000 diverse tasks at $0.6 per trajectory.

Key Points

  • Core Innovation: Exploits information asymmetry - tasks are easy to generate step-by-step but hard to solve all at once
  • Pipeline Architecture: Uses 6 LLM-based agents (task proposer, executor, verifier, reviser, follow-up proposer, summarizer)
  • Scalable Generation: Produces complex long-horizon tasks by iteratively chaining simple, solvable subtasks
  • Cost Efficiency: Achieves $0.6 per trajectory vs. $4-425 for human-annotated datasets
  • Difficulty Control: Fine-grained complexity control by varying number of summarized subtasks (levels 1-6)
  • Performance Results: SOTA agents drop from 18% success at level 1 to 4% at level 6, showing benchmark difficulty
  • Quality Metrics: 88-94% human evaluation scores across feasibility, coherence, persona relevance, and verifier accuracy
  • Environment: Built on OSWorld desktop environment with 1920×1080 screenshots and pyautogui actions
  • Task Diversity: Spans multiple software applications (60%+ use 2+ apps), with realistic multi-step workflows

Concepts Covered

Figures and Images

  • Figure 1: Complete AgentSynth pipeline diagram showing 6-agent workflow with persona input
  • Figure 2: Verifier calibration charts showing binary agreement and completion score correlation
  • Figure 3: Dataset statistics showing task complexity scaling and software distribution
  • Figure 4: Model performance results across difficulty levels for multiple SOTA agents
  • Figure 5: Comparison of bare LLMs vs. Agent S3 scaffolding performance

Related Concepts