source: "raw/articles/code2world-a-gui-world-model-via-renderable-code-generation.md"

Summary: Code2World: A GUI World Model via Renderable Code Generation

TL;DR: Code2World is a vision-language model that predicts next GUI states by generating renderable HTML code instead of pixels, achieving high visual fidelity while enabling fine-grained structural control for autonomous GUI agents.

Key Points

Code2World generates renderable HTML code to simulate next visual states rather than using pixel-based or text-based approaches
AndroidCode dataset contains over 80K high-quality screen-action pairs created by translating GUI trajectories into HTML
Uses visual-feedback revision loop to refine synthesized code, ensuring SigLIP score > 0.9 for strict alignment
Two-stage training: SFT cold start followed by Render-Aware Reinforcement Learning (RARL) with dual rewards
RARL uses Group Relative Policy Optimization (GRPO) with visual semantic and action consistency rewards
Code2World-8B rivals GPT-5 and Gemini-3-Pro-Image performance on next UI prediction
Enhances downstream navigation by +9.5% success rate boost for Gemini-2.5-Flash on AndroidWorld
Implements "Propose, Simulate, Select" pipeline for GUI agent enhancement
Evaluation on Android Control (ID) and GUI Odyssey (OOD) benchmarks shows superior generalization

Concepts Covered

GUI World Models — Core concept of predicting future interface states for autonomous agents
Renderable Code Generation — Novel approach using HTML generation instead of pixel prediction
Vision-Language Models — Adaptation of VLMs for GUI state prediction tasks
Reinforcement Learning from Human Feedback — RARL methodology for training with visual feedback
GUI Agent Navigation — Downstream application showing practical benefits
Cross-Platform Generalization — Testing robustness across different devices and applications
Visual Semantic Alignment — Ensuring generated code produces visually accurate results

Images and Figures

![code2world-a-gui-world-model-via-renderable-code-generation/img-0.png] — Project icon
![code2world-a-gui-world-model-via-renderable-code-generation/img-1.png] — Framework illustration showing input GUI + action → renderable code → predicted screenshot
![code2world-a-gui-world-model-via-renderable-code-generation/img-2.png] — Data synthesis pipeline and two-stage model optimization methodology
![code2world-a-gui-world-model-via-renderable-code-generation/img-3.png] — "Propose, Simulate, Select" pipeline for GUI agent enhancement
![code2world-a-gui-world-model-via-renderable-code-generation/img-4.png] — Quantitative comparison table across benchmarks
![code2world-a-gui-world-model-via-renderable-code-generation/img-5.png] through ![code2world-a-gui-world-model-via-renderable-code-generation/img-8.png] — Qualitative comparison examples showing email app launch, news app navigation, reminder completion, and e-commerce filtering

source: "raw/articles/code2world-a-gui-world-model-via-renderable-code-generation.md"

Summary: Code2World: A GUI World Model via Renderable Code Generation

Key Points

Concepts Covered

Images and Figures

Related Concepts