source: "raw/articles/insta-towards-internet-scale-training-for-agents.md"

Summary: InSTA: Towards Internet-Scale Training For Agents

TL;DR: A pipeline that uses LLMs to automatically annotate 150k websites with tasks and train web navigation agents without human supervision, achieving competitive performance with frontier models using 235x smaller models.

Key Points

  • Developed InSTA pipeline with 3 stages: LLM task generation, agent trajectory collection, and LLM-based filtering
  • Annotated 150,000 websites (from top 1M sites) with tasks, 750x more than existing datasets (~200 sites)
  • Safety filter achieves 97% accuracy in detecting harmful content with 98-100% recall
  • Generated 2.2M screenshots, 2.2M action traces, and 150k judge evaluations
  • LLM judge evaluates trajectory success with 82.6% accuracy
  • Trained Qwen 3 1.7B model achieves 56.9% success rate, outperforming Qwen 3 235B and Llama 4 Maverick (400B params)
  • Models zero-shot transfer to WebVoyager benchmark without using any WebVoyager training data
  • Performance scales with test-time compute (reasoning budget)
  • Achieved 94.7% performance of Gemini 2.5 Flash with 235x smaller model
  • Data collection cost: $521.55 using 1,200 v100 GPU hours
  • Released complete pipeline, models, and data at data-for-agents.github.io

Concepts Covered

  • Language Model Agents — Web navigation agents using LLMs with Playwright API for browser automation
  • Task Generation — Automated pipeline for creating challenging web tasks using LLM feedback loops
  • Safety Filtering — LLM-based content moderation achieving 97% accuracy in detecting harmful websites
  • LLM Judges — Language models evaluating agent performance with 82.6% accuracy on trajectory success
  • Supervised Fine-tuning — Training smaller models on LLM-generated data to match frontier performance
  • Zero-shot Transfer — Models trained on InSTA data transferring to unseen benchmarks without additional training
  • Test-time Compute Scaling — Performance improvements with increased reasoning token budgets
  • Internet-scale Data — Leveraging 300M+ websites from Common Crawl for agent training
  • Multimodal Agents — Agents processing both HTML/text and screenshot data
  • Function Calling — Agents generating JSON function calls for browser interactions

Figures and Images

  • Figure 1: Overview of InSTA pipeline showing progression from 1M sites to 150k annotated tasks
  • Figure 2: Task generation process filtering 85% of websites for safety
  • Figure 3: Word cloud of most frequent task terms showing diversity
  • Figure 4: Automatic evaluation pipeline with LLM judges
  • Figure 5: Judge accuracy analysis by PageRank and confidence levels
  • Figure 6: Dataset statistics showing 2.2M traces with 50% success rate
  • Figure 7: Performance scaling with data size and judge filtering
  • Figure 8: Zero-shot transfer results on WebVoyager benchmark
  • Figure 9: Static benchmark improvements with InSTA data
  • Figure 10: Test-time compute scaling for reasoning budgets

Related Concepts