← Library
source: "raw/articles/in-place-test-time-training.md"
Summary: In-Place Test-Time Training
TL;DR: A framework that adds test-time training capabilities to LLMs by repurposing existing MLP blocks as adaptable fast weights, enabling dynamic parameter updates at inference time without architectural changes.
Key Points
- Problem: Traditional LLMs follow a static "train then deploy" paradigm that prevents dynamic adaptation to new information during inference
- Core Innovation: Treats the final projection matrix of MLP blocks as "fast weights" that can be updated in-place during inference
- Drop-in Design: No architectural modifications needed - can be added to pre-trained models without costly retraining
- Efficiency: Uses chunk-wise updates instead of per-token updates, compatible with context parallelism
- LM-Aligned Objective: Replaces generic reconstruction targets with next-token prediction aligned objectives
- Theoretical Foundation: Proves that LM-aligned targets increase correct token logits while keeping others unchanged
- Results: 4B parameter model achieves superior performance on contexts up to 128k tokens; consistent improvements when trained from scratch
- Scalability: Tested on models from 500M to 14B parameters with consistent gains
Concepts Covered
- Test-Time Training — Dynamic parameter adaptation during inference using fast weights
- Fast Weights — Small subset of model parameters updated on-the-fly to store contextual information
- Context Parallelism — Parallel processing of sequence chunks while maintaining causal ordering
- MLP Repurposing — Using existing MLP blocks as adaptable memory rather than adding new components
- Next-Token Prediction — Autoregressive language modeling objective that the framework aligns with
- Chunk-wise Updates — Processing sequences in blocks for computational efficiency
- In-Context Learning — Model adaptation through input context rather than parameter updates
- Long-Context Modeling — Handling sequences beyond typical context window limits
Figures and Data
- Figure 1: Framework overview showing sequential chunk processing with apply-then-update cycles
- Figure 2: Sliding window perplexity comparisons showing consistent improvements across context lengths
- Figure 3: Ablation studies on state size, chunk size, and objective components
- Figure 4: Efficiency analysis showing minimal computational overhead
- Table 1: RULER benchmark results showing gains especially at longer contexts (128k+)
- Tables 2-3: Extension results across different model families and scales