Transformer Architecture

Summary: Neural network architecture based on self-attention mechanisms that processes sequences in parallel rather than sequentially. Introduced in "Attention Is All You Need" (2017), it became the foundation for large language models like GPT and BERT by enabling efficient training on long sequences through multi-head attention and positional encoding.

Overview

The Transformer architecture revolutionized sequence modeling by replacing recurrent and convolutional layers with self-attention mechanisms. Unlike RNNs that process tokens sequentially, Transformers can attend to all positions in a sequence simultaneously, enabling parallel computation and better capture of long-range dependencies.

The core innovation is the attention mechanism that computes relationships between all pairs of positions in a sequence. Each position can directly attend to any other position, allowing information to flow across arbitrary distances in a single step rather than through multiple sequential operations.

The architecture consists of encoder-decoder blocks (or decoder-only for autoregressive models) built from two main sub-layers: multi-head self-attention and position-wise feed-forward networks (MLP Repurposing). Residual connections and layer normalization stabilize training of deep networks.

Key Details

Core Components:

  • Multi-Head Attention: Parallel attention heads that capture different types of relationships between sequence positions
  • Feed-Forward Networks: Position-wise MLPs that process each token independently after attention
  • Positional Encoding: Sinusoidal or learned embeddings that inject position information since attention is permutation-invariant
  • Layer Normalization: Applied before each sub-layer (pre-norm) or after (post-norm) for training stability

Attention Mechanism:

  • Computes attention weights as softmax of scaled dot-product between queries and keys
  • Uses multiple heads to capture different representation subspaces
  • Scales by √(d_k) to prevent softmax saturation in high dimensions
  • O(n²) complexity for sequence length n, limiting context window size

Architectural Variants:

  • Encoder-Decoder: Full bidirectional encoder with causal decoder (original Transformer, T5)
  • Decoder-Only: Causal self-attention only (GPT family, most modern LLMs)
  • Encoder-Only: Bidirectional attention for classification tasks (BERT)

Training Characteristics:

  • Highly parallelizable compared to RNNs, enabling efficient GPU utilization
  • Requires large datasets and compute to achieve strong performance
  • Benefits from techniques like gradient accumulation, mixed precision, and distributed training

Relationships

  • Test-Time Training — Can be enhanced with dynamic parameter adaptation during inference using fast weights
  • Attention Mechanisms — Core component enabling parallel sequence processing and long-range dependencies
  • Next-Token Prediction — Primary training objective for autoregressive Transformer variants like GPT
  • Long-Context Modeling — Limited by quadratic attention complexity, addressed by techniques like sliding windows
  • In-Context Learning — Emergent capability where Transformers adapt to new tasks through input context alone
  • Parameter Efficient Fine-tuning — Methods like LoRA that adapt pre-trained Transformers with minimal parameter updates
  • Memory Augmented Networks — Extensions that add external memory to overcome context window limitations
  • Linear Attention — Approximation methods that reduce attention complexity from quadratic to linear
  • State Space Models — Alternative architectures like Mamba that achieve similar capabilities with linear complexity
  • Context Parallelism — Training technique that processes long sequences in parallel chunks

Sources

  • sources/in-place-test-time-training — Demonstrated how existing MLP blocks in Transformers can be repurposed as adaptable fast weights for test-time training without architectural changes