source: "raw/articles/gtr-guided-thought-reinforcement-prevents-thought-collapse-in-rl-based-vlm-agent.md"

Summary: GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

TL;DR: GTR addresses "thought collapse" in RL-trained VLM agents by using an automated corrector to guide the reasoning process, achieving 3-5x higher success rates than existing methods on complex visual decision-making tasks.

Key Points

  • Thought Collapse Problem: When training VLM agents with RL on complex tasks, the agent's chain-of-thought reasoning becomes state-irrelevant, incomplete, and rigid, leading to invalid actions and negative rewards
  • GTR Framework: Combines automated thought correction with RL training - uses a VLM corrector (GPT-4o) to evaluate and refine agent thoughts at each step, then performs SFT on thoughts while doing PPO on actions
  • Performance Results: On 24-point card game, GTR achieved 17.5% success rate vs 2.5% for baseline RL4VLM, outperforming much larger models like Qwen2-VL-72B (4.5%)
  • Technical Innovation: Uses Dataset Aggregation (DAgger) to handle distribution shift in thought cloning, incorporates format rewards and repetition penalties, enables tool usage for task-specific corrections
  • Generalization: Validated on both card games (gym_cards) and embodied tasks (ALFWorld), showing consistent improvements across different visual environments

Concepts Covered

Related Concepts