source: "raw/articles/molmopoint-better-pointing-architecture-for-vision-language-models-or-ai2.md"

Summary: MolmoPoint: Better Pointing Architecture for Vision-Language Models

TL;DR: AI2 introduces MolmoPoint, which replaces coordinate-based pointing in vision-language models with a coarse-to-fine grounding mechanism using special tokens, achieving state-of-the-art results while being more efficient and easier to train.

Key Points

  • Traditional vision-language models point by generating coordinates as text or tokens, which is inefficient and unnatural
  • MolmoPoint uses three special tokens (, , ) for coarse-to-fine grounding instead of coordinate generation
  • Reduces pointing from 8 tokens to 3 tokens per point while improving accuracy
  • Three model variants released: MolmoPoint-8B (general), MolmoPoint-GUI-8B (software interfaces), MolmoPoint-Vid-4B (video)
  • Achieves 70.7% on PointBench (vs 68.7% for Molmo 2) and 89.2 F1 on PixMo-Points (vs 85.2 for Molmo 2)
  • GUI model reaches 61.1 on ScreenSpot-Pro and 70.0 on OSWorldG, state-of-the-art among open models
  • Includes rotary embeddings to encode patch distances and no-more-points class to stop pointing
  • Ships with two new datasets: MolmoPoint-GUISyn (36K screenshots, 2M points) and MolmoPoint-TrackData
  • Easier to train than coordinate-based models, reaching peak performance faster with fewer examples

Concepts Covered

Related Concepts