Next-Token Prediction (NTP)

Summary: The fundamental training objective for autoregressive language models where the model learns to predict the next token in a sequence given previous context. The Test-Time Training framework aligns with this core objective to enable dynamic adaptation during inference.

Overview

Next-Token Prediction is the foundational task that underlies modern autoregressive language models. During training, the model learns to predict the probability distribution over the vocabulary for the next token given all previous tokens in a sequence. This simple yet powerful objective enables language models to develop sophisticated understanding of language patterns, context dependencies, and reasoning capabilities.

The NTP objective is crucial for maintaining consistency between training and inference phases. When language models adapt during inference through Test-Time Training, using objectives aligned with NTP ensures that the adaptation process strengthens the model's core predictive capabilities rather than introducing conflicting optimization targets.

Key Details

Core Mechanism: Models learn P(token_t+1 | token_1, ..., token_t) for each position in a sequence
Training Process: Uses teacher forcing where ground truth tokens provide supervision at each step
Alignment Importance: In-Place Test-Time Training demonstrates that NTP-aligned objectives outperform generic reconstruction targets
Theoretical Properties: NTP-aligned targets provably increase correct token logits while keeping incorrect token logits unchanged, unlike reconstruction-based alternatives
Implementation: Typically uses cross-entropy loss between predicted and actual next token distributions
Scaling Benefits: Effectiveness improves with larger model sizes and more training data
Context Dependency: Success depends on model's ability to effectively use preceding context through Attention Mechanisms

Relationships

Test-Time Training — aligns adaptation objectives with NTP for consistent optimization
In-Place Test-Time Training — uses NTP-aligned targets to improve dynamic adaptation performance
Fast Weights — updated during inference to better support NTP objectives in new contexts
Autoregressive Models — fundamental architecture that relies on NTP as primary training objective
Transformer Architecture — enables effective NTP through self-attention and positional encoding
In-Context Learning — emergent capability that arises from NTP training on diverse contexts
Long Context Modeling — benefits from NTP alignment when adapting to extended sequences
Induction Heads — attention patterns that emerge from NTP training to handle repetitive sequences

Sources

sources/in-place-test-time-training — demonstrates importance of NTP alignment for test-time adaptation and provides theoretical analysis of NTP-aligned objectives