← Library
source: "raw/articles/intrinsic-credit-assignment-for-long-horizon-interaction.md"
Summary: Intrinsic Credit Assignment for Long Horizon Interaction
TL;DR: ΔBelief-RL uses a language model's internal belief changes about target solutions as dense rewards for reinforcement learning, significantly improving information-seeking performance in long-horizon tasks.
Key Points
- Proposes ΔBelief-RL framework that leverages agent's intrinsic belief updates for credit assignment in multi-turn interactions
- Uses change in log-probability assigned to target solution as dense reward signal: Δ Belief_t = log(b_t/b_{t-1})
- CIA (Credit assignment via Intrinsic Assessment) models at 1.7B-4B parameters outperform DeepSeek-V3.2 (670B) by 10-19% on 20 Questions task
- Performance continues improving when test-time interactions extend beyond 20-turn training horizon (up to 50 turns)
- Strong generalization to out-of-distribution tasks: Customer Service, User Personalization, Guess My City, Murder Mystery
- Uses Turn-wise GRPO for RL training with per-turn reward clipping at zero to avoid penalizing temporary confidence decreases
- Validation through best-of-8 sampling shows ΔBelief maximization significantly improves success rates
- Training dynamics show faster reduction in episode length and fewer repeated questions compared to standard GRPO
Concepts Covered
- Reinforcement Learning — core training methodology with dense intrinsic rewards
- Credit Assignment — main problem being solved through belief change tracking
- Long Horizon Planning — application domain for multi-turn interaction tasks
- Language Model Beliefs — foundation for extracting internal probability distributions
- Information Seeking — task category including 20 Questions and diagnostic scenarios
- Test-Time Scaling — performance improvement with increased interaction budgets
- Out-of-Distribution Generalization — transfer to unseen task domains
- Turn-wise GRPO — modified RL algorithm for per-turn advantage computation