source: "raw/articles/real-benchmarking-autonomous-agents-on-deterministic-simulations.md"

Summary: REAL: Benchmarking Autonomous Agents on Deterministic Simulations

TL;DR: REAL is a benchmark framework using deterministic replicas of 11 real websites to evaluate autonomous agents on 112 practical tasks, revealing that frontier language models achieve only 41% success rates.

Key Points

  • Framework includes high-fidelity, deterministic replicas of 11 widely-used websites across e-commerce, travel, communication, and professional networking domains
  • Benchmark consists of 112 practical tasks mirroring everyday complex user interactions requiring information retrieval and state-changing actions
  • Evaluation combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval
  • Supports both open-source and proprietary agent systems through flexible evaluation harness accommodating black-box commands in browser environments
  • Frontier language models achieve maximum 41% success rate, highlighting critical gaps in autonomous web navigation capabilities
  • Framework enables reproducible evaluation and scalable post-training data generation
  • All interactions occur in controlled setting, eliminating safety risks
  • Websites, framework, and leaderboard available at realevals.xyz and GitHub

Concepts Covered

Related Concepts