← Library
source: "raw/articles/real-benchmarking-autonomous-agents-on-deterministic-simulations.md"
Summary: REAL: Benchmarking Autonomous Agents on Deterministic Simulations
TL;DR: REAL is a benchmark framework using deterministic replicas of 11 real websites to evaluate autonomous agents on 112 practical tasks, revealing that frontier language models achieve only 41% success rates.
Key Points
- Framework includes high-fidelity, deterministic replicas of 11 widely-used websites across e-commerce, travel, communication, and professional networking domains
- Benchmark consists of 112 practical tasks mirroring everyday complex user interactions requiring information retrieval and state-changing actions
- Evaluation combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval
- Supports both open-source and proprietary agent systems through flexible evaluation harness accommodating black-box commands in browser environments
- Frontier language models achieve maximum 41% success rate, highlighting critical gaps in autonomous web navigation capabilities
- Framework enables reproducible evaluation and scalable post-training data generation
- All interactions occur in controlled setting, eliminating safety risks
- Websites, framework, and leaderboard available at realevals.xyz and GitHub
Concepts Covered
- Autonomous Agent Evaluation — primary focus on benchmarking agent capabilities
- Deterministic Simulation — uses deterministic website replicas for reproducible testing
- Multi-turn Agent Interactions — evaluates complex, multi-step user task completion
- Web Navigation — tests agent ability to navigate and interact with web interfaces
- LLM-based Evaluation — uses language models for rubric-guided assessment
- Agent Benchmarking — establishes standardized testing framework for autonomous systems