← Library
source: "raw/articles/molmopoint-better-pointing-architecture-for-vision-language-models-or-ai2.md"
Summary: MolmoPoint: Better Pointing Architecture for Vision-Language Models
TL;DR: AI2 introduces MolmoPoint, which replaces coordinate-based pointing in vision-language models with a coarse-to-fine grounding mechanism using special tokens, achieving state-of-the-art results while being more efficient and easier to train.
Key Points
- Traditional vision-language models point by generating coordinates as text or tokens, which is inefficient and unnatural
- MolmoPoint uses three special tokens (
, , ) for coarse-to-fine grounding instead of coordinate generation - Reduces pointing from 8 tokens to 3 tokens per point while improving accuracy
- Three model variants released: MolmoPoint-8B (general), MolmoPoint-GUI-8B (software interfaces), MolmoPoint-Vid-4B (video)
- Achieves 70.7% on PointBench (vs 68.7% for Molmo 2) and 89.2 F1 on PixMo-Points (vs 85.2 for Molmo 2)
- GUI model reaches 61.1 on ScreenSpot-Pro and 70.0 on OSWorldG, state-of-the-art among open models
- Includes rotary embeddings to encode patch distances and no-more-points class to stop pointing
- Ships with two new datasets: MolmoPoint-GUISyn (36K screenshots, 2M points) and MolmoPoint-TrackData
- Easier to train than coordinate-based models, reaching peak performance faster with fewer examples
Concepts Covered
- Vision-Language Models — core architecture being improved
- Multimodal Grounding — the task of connecting language to visual locations
- Coarse-to-Fine Processing — the hierarchical approach used for pointing
- Visual Transformers — underlying architecture providing visual tokens
- GUI Automation — specialized application domain for interface interaction
- Object Tracking — video-specific grounding task
- Synthetic Data Generation — method for creating GUI training data
- Rotary Position Embeddings — technique for encoding spatial relationships