source: "raw/articles/intrinsic-credit-assignment-for-long-horizon-interaction.md"

Summary: Intrinsic Credit Assignment for Long Horizon Interaction

TL;DR: ΔBelief-RL uses a language model's internal belief changes about target solutions as dense rewards for reinforcement learning, significantly improving information-seeking performance in long-horizon tasks.

Key Points

  • Proposes ΔBelief-RL framework that leverages agent's intrinsic belief updates for credit assignment in multi-turn interactions
  • Uses change in log-probability assigned to target solution as dense reward signal: Δ Belief_t = log(b_t/b_{t-1})
  • CIA (Credit assignment via Intrinsic Assessment) models at 1.7B-4B parameters outperform DeepSeek-V3.2 (670B) by 10-19% on 20 Questions task
  • Performance continues improving when test-time interactions extend beyond 20-turn training horizon (up to 50 turns)
  • Strong generalization to out-of-distribution tasks: Customer Service, User Personalization, Guess My City, Murder Mystery
  • Uses Turn-wise GRPO for RL training with per-turn reward clipping at zero to avoid penalizing temporary confidence decreases
  • Validation through best-of-8 sampling shows ΔBelief maximization significantly improves success rates
  • Training dynamics show faster reduction in episode length and fewer repeated questions compared to standard GRPO

Concepts Covered

Related Concepts