← Library
source: "raw/articles/halluminate-rl-environments-for-financial-services.md"
Summary: Westworld - Simulated Web Environments for Agent Evaluation
TL;DR: Halluminate and Yutori built Westworld, a suite of 5 realistic web simulators with 100 tasks for evaluating and training web agents, achieving more reproducible benchmarking than real websites while enabling RL training.
Key Points
- Real website evaluation suffers from CAPTCHAs, authentication blocks, changing data, and UI drift that create evaluation noise
- Westworld includes 5 simulators: Noodle Flights, Travelpedia (travel), GoodBuy, Azora, Megamart (ecommerce)
- Uses task-centric simulation approach rather than app-centric - focuses on core workflows instead of rebuilding entire sites
- Employs three types of verifiable rewards: state-based unit tests, component-level verification, real-time ground truth calculation
- Performance results show Yutori n1 (trained on Westworld) achieved 86% average vs 67.7% for Claude Sonnet 4.5
- Common failure modes include UI grounding (calendar date picking), reasoning errors on multi-step tasks, unfamiliarity with site-specific navigation patterns
- Simulators require significant upfront engineering and ongoing maintenance as real sites evolve
Concepts Covered
- Web Agent Evaluation — comparison of simulator vs real website benchmarking approaches
- Computer Use Models — testing CUA models from OpenAI, Anthropic, Google on web tasks
- Reinforcement Learning — using simulators as training environments for agent post-training
- Task-Centric Simulation — design philosophy focusing on specific workflows rather than complete site recreation
- Verifiable Rewards — deterministic evaluation methods using unit tests and ground truth calculation
- Flight Booking Automation — simulated airline booking workflows with multi-leg routes and constraints
- E-commerce Automation — simulated shopping and checkout flows across multiple retail environments
Images and Figures
- Noodle Flights interface
- Noodle Flights simulator
- Travelpedia simulator
- GoodBuy simulator
- Azora simulator
- Megamart simulator- Error analysis examples showing calendar date picking failures, flight selection reasoning errors, and site navigation challenges
- Japan Airlines redesign comparison showing maintenance challenges for simulators