D2Snap Algorithm

Summary: A three-phase DOM downsampling algorithm that consolidates HTML nodes based on UI feature semantics while maintaining valid HTML structure. Reduces DOM size by ~96% (from 1MB to ~10KB) while preserving or improving performance for LLM-based web agents through semantic-aware element classification and selective retention.

Overview

D2Snap addresses the fundamental challenge that DOM Snapshots are typically too large (>1MB) for LLM Web Agents despite offering advantages over GUI Screenshots including better element targeting, no visual artifacts, faster transfer, and leveraging LLMs' native HTML understanding. The algorithm represents the first approach to consolidate DOM nodes based on UI Feature Semantics rather than simple structural pruning.

The core innovation is a three-phase hierarchical approach that handles different element types with specialized strategies:

  1. Container Phase: Uses configurable retention parameters for structural elements
  2. Content Phase: Applies Markdown conversion for better text representation
  3. Interactive Phase: Preserves critical interactive elements while using TextRank Algorithm for text node summarization

Critical to the algorithm's success is maintaining valid HTML structure throughout downsampling, ensuring the resulting DOM remains parseable and semantically meaningful. Element classification into container, content, interactive, and other categories drives selective consolidation decisions, with interactive elements receiving highest preservation priority.

Key Details

Performance Metrics:

  • Achieves 67% success rate comparable to grounded GUI snapshots baseline (65%)
  • Best configuration (D2Snap.6,.9,.3) outperforms baseline by 8% (73% vs 65%)
  • Reduces mean byte size from 1e6 to 1e4 (~96% reduction)
  • Strong correlation between byte and token size (r=0.9994)
  • Evaluated on 52 web task records with 1e3 token order constraint

Technical Specifications:

  • Three configurable retention parameters for different element categories
  • Incorporates TextRank Algorithm for sentence-level text downsampling within nodes
  • AdaptiveD2Snap variant downsample ~67% of DOMs below 8K tokens, 100% below 32K tokens
  • Uses iterative parameter adjustment for adaptive token limit compliance
  • Ground truth semantic ratings derived from GPT-4o classifications

UI Feature Analysis:

  • Hierarchy emerges as most valuable UI feature for LLMs among those tested
  • Removing hierarchy causes greater performance degradation than other feature removals
  • Vision Capabilities show minimal impact - grounded text-only snapshots (63%) perform nearly as well as full grounded GUI snapshots (65%)
  • DOM-based targeting via CSS Selectors enables relative positioning without visual grounding

Element Classification Framework:

  • Container elements: provide structural organization and hierarchy
  • Content elements: text, images, media requiring Markdown conversion
  • Interactive elements: buttons, links, form controls (highest retention priority)
  • Other elements: remaining HTML nodes handled with basic consolidation

Relationships

Sources