MLP Blocks

Summary: Standard feed-forward components in Transformer architectures that can serve dual roles as both static knowledge storage (slow weights) and dynamic contextual adaptation (fast weights). In the In-Place Test-Time Training framework, their final projection matrices (Wdown) become adaptable parameters that update during inference without requiring architectural changes.

Overview

MLP Blocks are fundamental feed-forward neural network components of the Transformer Architecture that process token representations through linear transformations and nonlinear activations. Each block typically consists of an up-projection (Wup), activation function, and down-projection (Wdown) that transforms hidden states between layers.

The In-Place Test-Time Training framework repurposes these existing components as adaptive memory units by treating their final projection matrices (Wdown) as Fast Weights while keeping other parameters as slow weights. This dual-weight system enables dynamic adaptation without specialized memory layers or architectural modifications—the same components handle both general knowledge storage and context-specific adaptation.

The framework implements a chunk-wise processing approach where MLP blocks follow an "apply-then-update" cycle across sequential chunks of 512-1024 tokens. During the apply phase, current fast weights process input chunks normally. During the update phase, the fast weights adapt based on a Next-Token Prediction aligned objective that uses 1D convolution to incorporate future token information. This cycle enables continuous adaptation to new contextual patterns while maintaining computational efficiency through Context Parallelism using associative parallel scan algorithms.

Theoretical analysis shows that LM-aligned targets increase correct token logits while keeping incorrect token logits unchanged, unlike generic reconstruction targets used in traditional Test-Time Training approaches. This mathematical guarantee ensures that adaptation improves performance without degrading existing capabilities.

Key Details

  • Dual Weight System: Slow weights preserve pre-trained knowledge while fast weights (Wdown matrices) adapt to context
  • No Architecture Changes: Drop-in enhancement that preserves all existing pre-trained parameters and model compatibility
  • Optimal Chunk Sizes: 512-1024 tokens balance adaptation capability with computational efficiency
  • Selective Updates: Only Wdown projection matrices update during inference, minimizing memory overhead
  • Theoretical Guarantees: Proven to increase correct token logits while keeping incorrect token logits unchanged through LM-aligned objectives
  • Scale Compatibility: Effective across model sizes—validated on Qwen3-4B, LLaMA-3.1-8B, and Qwen3-14B models
  • Context Performance: Enables superior performance on contexts up to 128k tokens with extrapolation capability to 256k
  • Training Efficiency: Outperforms competitive TTT baselines when trained from scratch at 500M, 1.5B, and 4B parameter scales
  • Implementation Efficiency: Compatible with Context Parallelism and maintains strict causality requirements
  • Memory Requirements: Minimal additional memory overhead compared to base model

Relationships

  • Fast Weights — MLP projection matrices (Wdown) serve as the adaptable fast weight parameters that update during inference
  • Test-Time Training — Core paradigm enabling dynamic updating of MLP components during inference without gradient-based optimization
  • Transformer Architecture — Base architecture where MLP blocks are standard feed-forward components between attention layers
  • Next-Token Prediction — Objective function that guides how MLP fast weights are updated using future token information via 1D convolution
  • Chunk-wise Updates — Processing strategy that determines how MLP blocks adapt over sequential token chunks for computational efficiency
  • In-Context Learning — Alternative approach that MLP block adaptation can augment or replace, especially beyond context window constraints
  • Long Context Modeling — Primary application domain where adaptive MLP blocks provide significant performance benefits
  • Context Parallelism — Parallel processing technique using associative parallel scans that remains compatible with MLP fast weight updates
  • Induction Heads — Theoretical framework used to analyze the benefits of MLP block adaptation with LM-aligned objectives
  • Dynamic Adaptation — Core capability enabled by repurposing MLP blocks for contextual memory and learning

Sources

  • sources/in-place-test-time-training — Framework description, dual-weight system design, theoretical analysis proving logit guarantees, experimental validation across multiple model scales, and computational efficiency comparisons