source: "raw/articles/in-place-test-time-training.md"

Summary: In-Place Test-Time Training

TL;DR: A framework that adds test-time training capabilities to LLMs by repurposing existing MLP blocks as adaptable fast weights, enabling dynamic parameter updates at inference time without architectural changes.

Key Points

Problem: Traditional LLMs follow a static "train then deploy" paradigm that prevents dynamic adaptation to new information during inference
Core Innovation: Treats the final projection matrix of MLP blocks as "fast weights" that can be updated in-place during inference
Drop-in Design: No architectural modifications needed - can be added to pre-trained models without costly retraining
Efficiency: Uses chunk-wise updates instead of per-token updates, compatible with context parallelism
LM-Aligned Objective: Replaces generic reconstruction targets with next-token prediction aligned objectives
Theoretical Foundation: Proves that LM-aligned targets increase correct token logits while keeping others unchanged
Results: 4B parameter model achieves superior performance on contexts up to 128k tokens; consistent improvements when trained from scratch
Scalability: Tested on models from 500M to 14B parameters with consistent gains

Concepts Covered

Test-Time Training — Dynamic parameter adaptation during inference using fast weights
Fast Weights — Small subset of model parameters updated on-the-fly to store contextual information
Context Parallelism — Parallel processing of sequence chunks while maintaining causal ordering
MLP Repurposing — Using existing MLP blocks as adaptable memory rather than adding new components
Next-Token Prediction — Autoregressive language modeling objective that the framework aligns with
Chunk-wise Updates — Processing sequences in blocks for computational efficiency
In-Context Learning — Model adaptation through input context rather than parameter updates
Long-Context Modeling — Handling sequences beyond typical context window limits

Figures and Data

Figure 1: Framework overview showing sequential chunk processing with apply-then-update cycles
Figure 2: Sliding window perplexity comparisons showing consistent improvements across context lengths
Figure 3: Ablation studies on state size, chunk size, and objective components
Figure 4: Efficiency analysis showing minimal computational overhead
Table 1: RULER benchmark results showing gains especially at longer contexts (128k+)
Tables 2-3: Extension results across different model families and scales

source: "raw/articles/in-place-test-time-training.md"

Summary: In-Place Test-Time Training

Key Points

Concepts Covered

Figures and Data

Related Concepts