← Library
source: "raw/articles/gtr-guided-thought-reinforcement-prevents-thought-collapse-in-rl-based-vlm-agent.md"
Summary: GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
TL;DR: GTR addresses "thought collapse" in RL-trained VLM agents by using an automated corrector to guide the reasoning process, achieving 3-5x higher success rates than existing methods on complex visual decision-making tasks.
Key Points
- Thought Collapse Problem: When training VLM agents with RL on complex tasks, the agent's chain-of-thought reasoning becomes state-irrelevant, incomplete, and rigid, leading to invalid actions and negative rewards
- GTR Framework: Combines automated thought correction with RL training - uses a VLM corrector (GPT-4o) to evaluate and refine agent thoughts at each step, then performs SFT on thoughts while doing PPO on actions
- Performance Results: On 24-point card game, GTR achieved 17.5% success rate vs 2.5% for baseline RL4VLM, outperforming much larger models like Qwen2-VL-72B (4.5%)
- Technical Innovation: Uses Dataset Aggregation (DAgger) to handle distribution shift in thought cloning, incorporates format rewards and repetition penalties, enables tool usage for task-specific corrections
- Generalization: Validated on both card games (gym_cards) and embodied tasks (ALFWorld), showing consistent improvements across different visual environments
Concepts Covered
- Reinforcement Learning — core training paradigm that suffers from thought collapse when rewards only consider final actions
- Vision-Language Models — VLMs like LLaVA-7B used as decision-making agents in visual environments
- Chain-of-Thought Reasoning — intermediate reasoning steps that deteriorate during RL training without process guidance
- Process Supervision — guidance during intermediate reasoning steps, implemented via automated thought correction
- Dataset Aggregation (DAgger) — imitation learning technique used to mitigate distribution shift in thought cloning
- Proximal Policy Optimization (PPO) — RL algorithm used for action optimization while SFT handles thought correction
- Multi-step Decision Making — sequential action planning in environments like 24-point games and household tasks
- Thought Correction — automated evaluation and refinement of agent reasoning using external VLM corrector