Online-Mind2Web

Summary: A dataset consisting of 52 records with human annotations used for evaluating the performance of web agents. The dataset serves as a benchmark for comparing different web interaction approaches, particularly DOM-based versus GUI-based methods.

Overview

Online-Mind2Web is an evaluation dataset specifically designed to assess web agent performance across various web-based tasks. The dataset contains human-annotated records that provide ground truth for measuring how effectively different web interaction methods can accomplish real-world web navigation and interaction scenarios.

The dataset has been utilized in research comparing DOM Snapshots versus GUI Snapshots for LLM-Based Interaction, where it served as the testing ground for novel approaches like DOM Downsampling algorithms. Each record in the dataset represents a web interaction task with corresponding human annotations that establish the expected outcomes and successful completion criteria.

Key Details

  • Dataset Size: 52 annotated records
  • Purpose: Benchmarking web agent performance across different interaction modalities
  • Annotation Type: Human-generated ground truth labels
  • Primary Use Case: Evaluating Web Agents that use large language models as backends
  • Evaluation Metrics: Success rates measured against human annotations
  • Research Application: Comparative analysis of snapshot techniques, with results showing DOM-based methods achieving 67-73% success rates versus 65% for GUI baselines

The dataset enables researchers to test various approaches to web interaction, including traditional screenshot-based methods with visual grounding cues and newer DOM-based approaches that leverage structured HTML representations.

Relationships

  • DOM Downsampling — Evaluated using this dataset to measure algorithm effectiveness
  • Web Agents — Performance benchmarked against human annotations in this dataset
  • GUI Snapshots — Baseline approach tested and compared using these records
  • LLM-Based Interaction — Web agent architectures evaluated through dataset tasks
  • D2Snap Algorithm — Specific downsampling technique validated using dataset metrics
  • Element Extraction — Alternative DOM processing methods compared against dataset ground truth

Sources