← Library
source: "raw/articles/arxiv-250411543.md"
Summary: REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
TL;DR: REAL introduces 11 high-fidelity, deterministic replicas of popular websites (Airbnb, Amazon, Uber, etc.) with 112 realistic tasks for evaluating web agents, showing frontier models achieve at most 41% success rates.
Key Points
- Created 11 deterministic website replicas across e-commerce, travel, communication, and professional networking domains
- Designed 112 practical multi-turn tasks requiring both information retrieval and state-changing actions
- Built with modern web stack (React, Next.js) and publicly hosted for easy access
- Supports both high-level (Playwright) and low-level (CDP) browser interaction modes
- Evaluation combines programmatic state verification and LLM-judge rubric assessment
- Frontier model performance: Claude 3.7-Sonnet Thinking (41.07%), Gemini 2.5 Pro (38.39%), o3 (34.82%), GPT-4o (14.29%)
- Common failure modes: inadequate state verification and navigation dead ends with poor recovery
- Framework designed for both evaluation and RL training with well-defined reward signals
Concepts Covered
- Web Agent Benchmarking — provides comprehensive evaluation framework for autonomous web navigation
- Deterministic Simulation — eliminates reproducibility issues through fixed data and timestamps
- Browser Automation — supports Playwright and Chrome DevTools Protocol interaction modes
- Multi-turn Task Evaluation — assesses complex workflows requiring planning and state management
- Reinforcement Learning for Agents — enables RL training through state tracking and reward functions
- LLM-based Evaluation — uses rubric-guided judgment for information retrieval tasks