← Library
source: "raw/articles/code2world-a-gui-world-model-via-renderable-code-generation.md"
Summary: Code2World: A GUI World Model via Renderable Code Generation
TL;DR: Code2World is a vision-language model that predicts next GUI states by generating renderable HTML code instead of pixels, achieving high visual fidelity while enabling fine-grained structural control for autonomous GUI agents.
Key Points
- Code2World generates renderable HTML code to simulate next visual states rather than using pixel-based or text-based approaches
- AndroidCode dataset contains over 80K high-quality screen-action pairs created by translating GUI trajectories into HTML
- Uses visual-feedback revision loop to refine synthesized code, ensuring SigLIP score > 0.9 for strict alignment
- Two-stage training: SFT cold start followed by Render-Aware Reinforcement Learning (RARL) with dual rewards
- RARL uses Group Relative Policy Optimization (GRPO) with visual semantic and action consistency rewards
- Code2World-8B rivals GPT-5 and Gemini-3-Pro-Image performance on next UI prediction
- Enhances downstream navigation by +9.5% success rate boost for Gemini-2.5-Flash on AndroidWorld
- Implements "Propose, Simulate, Select" pipeline for GUI agent enhancement
- Evaluation on Android Control (ID) and GUI Odyssey (OOD) benchmarks shows superior generalization
Concepts Covered
- GUI World Models — Core concept of predicting future interface states for autonomous agents
- Renderable Code Generation — Novel approach using HTML generation instead of pixel prediction
- Vision-Language Models — Adaptation of VLMs for GUI state prediction tasks
- Reinforcement Learning from Human Feedback — RARL methodology for training with visual feedback
- GUI Agent Navigation — Downstream application showing practical benefits
- Cross-Platform Generalization — Testing robustness across different devices and applications
- Visual Semantic Alignment — Ensuring generated code produces visually accurate results
Images and Figures
- ![code2world-a-gui-world-model-via-renderable-code-generation/img-0.png] — Project icon
- ![code2world-a-gui-world-model-via-renderable-code-generation/img-1.png] — Framework illustration showing input GUI + action → renderable code → predicted screenshot
- ![code2world-a-gui-world-model-via-renderable-code-generation/img-2.png] — Data synthesis pipeline and two-stage model optimization methodology
- ![code2world-a-gui-world-model-via-renderable-code-generation/img-3.png] — "Propose, Simulate, Select" pipeline for GUI agent enhancement
- ![code2world-a-gui-world-model-via-renderable-code-generation/img-4.png] — Quantitative comparison table across benchmarks
- ![code2world-a-gui-world-model-via-renderable-code-generation/img-5.png] through ![code2world-a-gui-world-model-via-renderable-code-generation/img-8.png] — Qualitative comparison examples showing email app launch, news app navigation, reminder completion, and e-commerce filtering