Screenshot Context Management

Summary: A critical component of computer use agent evaluation that efficiently processes visual evidence across long interaction sequences. Rather than using all screenshots or simple truncation, effective context management selectively identifies the most relevant visual evidence for each evaluation criterion.

Overview

Screenshot Context Management addresses a fundamental challenge in evaluating computer use agents: how to handle the extensive visual evidence generated during multi-step interactions. Traditional approaches either truncate screenshots to fit context windows (losing potentially crucial evidence) or attempt to process all screenshots (leading to information overload and computational inefficiency).

The key innovation is using a screenshot relevance matrix that maps each evaluation criterion to the most relevant screenshots from the trajectory. This selective approach ensures that verifiers have access to the specific visual evidence needed to assess each aspect of agent performance, while maintaining computational efficiency and avoiding the dilution of important information in overly long contexts.

Key Details

  • Screenshot relevance matrix: Selects top-k most relevant screenshots per rubric criterion rather than using chronological order or arbitrary limits
  • Criterion-specific selection: Different evaluation criteria may require different screenshots - task completion might need final screens while process evaluation needs interaction sequences
  • Context window optimization: Manages the trade-off between comprehensive evidence and practical token limits in large language models
  • Multi-modal evidence integration: Combines visual screenshots with action logs and text observations for comprehensive evaluation
  • Temporal relevance weighting: More recent screenshots often carry higher relevance for outcome assessment, while process evaluation may require distributed sampling
  • Hallucination prevention: By providing relevant visual evidence, reduces agent and verifier tendency to fabricate or assume unsupported facts

Relationships

Sources