source: "raw/articles/arxiv-250411543.md"

Summary: REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

TL;DR: REAL introduces 11 high-fidelity, deterministic replicas of popular websites (Airbnb, Amazon, Uber, etc.) with 112 realistic tasks for evaluating web agents, showing frontier models achieve at most 41% success rates.

Key Points

Created 11 deterministic website replicas across e-commerce, travel, communication, and professional networking domains
Designed 112 practical multi-turn tasks requiring both information retrieval and state-changing actions
Built with modern web stack (React, Next.js) and publicly hosted for easy access
Supports both high-level (Playwright) and low-level (CDP) browser interaction modes
Evaluation combines programmatic state verification and LLM-judge rubric assessment
Frontier model performance: Claude 3.7-Sonnet Thinking (41.07%), Gemini 2.5 Pro (38.39%), o3 (34.82%), GPT-4o (14.29%)
Common failure modes: inadequate state verification and navigation dead ends with poor recovery
Framework designed for both evaluation and RL training with well-defined reward signals

Concepts Covered

Web Agent Benchmarking — provides comprehensive evaluation framework for autonomous web navigation
Deterministic Simulation — eliminates reproducibility issues through fixed data and timestamps
Browser Automation — supports Playwright and Chrome DevTools Protocol interaction modes
Multi-turn Task Evaluation — assesses complex workflows requiring planning and state management
Reinforcement Learning for Agents — enables RL training through state tracking and reward functions
LLM-based Evaluation — uses rubric-guided judgment for information retrieval tasks

source: "raw/articles/arxiv-250411543.md"

Summary: REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Key Points

Concepts Covered

Related Concepts