source: "raw/articles/arxiv-250411543.md"

Summary: REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

TL;DR: REAL introduces 11 high-fidelity, deterministic replicas of popular websites (Airbnb, Amazon, Uber, etc.) with 112 realistic tasks for evaluating web agents, showing frontier models achieve at most 41% success rates.

Key Points

  • Created 11 deterministic website replicas across e-commerce, travel, communication, and professional networking domains
  • Designed 112 practical multi-turn tasks requiring both information retrieval and state-changing actions
  • Built with modern web stack (React, Next.js) and publicly hosted for easy access
  • Supports both high-level (Playwright) and low-level (CDP) browser interaction modes
  • Evaluation combines programmatic state verification and LLM-judge rubric assessment
  • Frontier model performance: Claude 3.7-Sonnet Thinking (41.07%), Gemini 2.5 Pro (38.39%), o3 (34.82%), GPT-4o (14.29%)
  • Common failure modes: inadequate state verification and navigation dead ends with poor recovery
  • Framework designed for both evaluation and RL training with well-defined reward signals

Concepts Covered

Related Concepts