source: "raw/articles/gtr-guided-thought-reinforcement-prevents-thought-collapse-in-rl-based-vlm-agent.md"

Summary: GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

TL;DR: GTR addresses "thought collapse" in RL-trained VLM agents by using an automated corrector to guide the reasoning process, achieving 3-5x higher success rates than existing methods on complex visual decision-making tasks.

Key Points

Thought Collapse Problem: When training VLM agents with RL on complex tasks, the agent's chain-of-thought reasoning becomes state-irrelevant, incomplete, and rigid, leading to invalid actions and negative rewards
GTR Framework: Combines automated thought correction with RL training - uses a VLM corrector (GPT-4o) to evaluate and refine agent thoughts at each step, then performs SFT on thoughts while doing PPO on actions
Performance Results: On 24-point card game, GTR achieved 17.5% success rate vs 2.5% for baseline RL4VLM, outperforming much larger models like Qwen2-VL-72B (4.5%)
Technical Innovation: Uses Dataset Aggregation (DAgger) to handle distribution shift in thought cloning, incorporates format rewards and repetition penalties, enables tool usage for task-specific corrections
Generalization: Validated on both card games (gym_cards) and embodied tasks (ALFWorld), showing consistent improvements across different visual environments

Concepts Covered

Reinforcement Learning — core training paradigm that suffers from thought collapse when rewards only consider final actions
Vision-Language Models — VLMs like LLaVA-7B used as decision-making agents in visual environments
Chain-of-Thought Reasoning — intermediate reasoning steps that deteriorate during RL training without process guidance
Process Supervision — guidance during intermediate reasoning steps, implemented via automated thought correction
Dataset Aggregation (DAgger) — imitation learning technique used to mitigate distribution shift in thought cloning
Proximal Policy Optimization (PPO) — RL algorithm used for action optimization while SFT handles thought correction
Multi-step Decision Making — sequential action planning in environments like 24-point games and household tasks
Thought Correction — automated evaluation and refinement of agent reasoning using external VLM corrector

source: "raw/articles/gtr-guided-thought-reinforcement-prevents-thought-collapse-in-rl-based-vlm-agent.md"

Summary: GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Key Points

Concepts Covered

Related Concepts