source: "raw/articles/insta-towards-internet-scale-training-for-agents.md"

Summary: InSTA: Towards Internet-Scale Training For Agents

TL;DR: A pipeline that uses LLMs to automatically annotate 150k websites with tasks and train web navigation agents without human supervision, achieving competitive performance with frontier models using 235x smaller models.

Key Points

Developed InSTA pipeline with 3 stages: LLM task generation, agent trajectory collection, and LLM-based filtering
Annotated 150,000 websites (from top 1M sites) with tasks, 750x more than existing datasets (~200 sites)
Safety filter achieves 97% accuracy in detecting harmful content with 98-100% recall
Generated 2.2M screenshots, 2.2M action traces, and 150k judge evaluations
LLM judge evaluates trajectory success with 82.6% accuracy
Trained Qwen 3 1.7B model achieves 56.9% success rate, outperforming Qwen 3 235B and Llama 4 Maverick (400B params)
Models zero-shot transfer to WebVoyager benchmark without using any WebVoyager training data
Performance scales with test-time compute (reasoning budget)
Achieved 94.7% performance of Gemini 2.5 Flash with 235x smaller model
Data collection cost: $521.55 using 1,200 v100 GPU hours
Released complete pipeline, models, and data at data-for-agents.github.io

Concepts Covered

Language Model Agents — Web navigation agents using LLMs with Playwright API for browser automation
Task Generation — Automated pipeline for creating challenging web tasks using LLM feedback loops
Safety Filtering — LLM-based content moderation achieving 97% accuracy in detecting harmful websites
LLM Judges — Language models evaluating agent performance with 82.6% accuracy on trajectory success
Supervised Fine-tuning — Training smaller models on LLM-generated data to match frontier performance
Zero-shot Transfer — Models trained on InSTA data transferring to unseen benchmarks without additional training
Test-time Compute Scaling — Performance improvements with increased reasoning token budgets
Internet-scale Data — Leveraging 300M+ websites from Common Crawl for agent training
Multimodal Agents — Agents processing both HTML/text and screenshot data
Function Calling — Agents generating JSON function calls for browser interactions

Figures and Images

Figure 1: Overview of InSTA pipeline showing progression from 1M sites to 150k annotated tasks
Figure 2: Task generation process filtering 85% of websites for safety
Figure 3: Word cloud of most frequent task terms showing diversity
Figure 4: Automatic evaluation pipeline with LLM judges
Figure 5: Judge accuracy analysis by PageRank and confidence levels
Figure 6: Dataset statistics showing 2.2M traces with 50% success rate
Figure 7: Performance scaling with data size and judge filtering
Figure 8: Zero-shot transfer results on WebVoyager benchmark
Figure 9: Static benchmark improvements with InSTA data
Figure 10: Test-time compute scaling for reasoning budgets

source: "raw/articles/insta-towards-internet-scale-training-for-agents.md"

Summary: InSTA: Towards Internet-Scale Training For Agents

Key Points

Concepts Covered

Figures and Images

Related Concepts