Attention Mechanisms

Summary: Neural network components that allow models to selectively focus on relevant parts of input sequences by computing weighted representations. They enable models to dynamically determine which information is most important for the current task, forming the foundation of modern transformer architectures.

Overview

Attention mechanisms solve the fundamental problem of how neural networks can focus on specific parts of their input when making predictions. Rather than treating all input elements equally, attention computes a set of weights that indicate the relative importance of each input element for the current context.

The core concept involves three main components:

  • Query (Q): Represents what information is being sought
  • Key (K): Represents the available information that can be matched against
  • Value (V): Contains the actual information content to be retrieved

The attention mechanism computes similarity scores between queries and keys, converts these to probability distributions via softmax, then uses these weights to create a weighted combination of values. This allows the model to create dynamic, context-dependent representations.

In transformer architectures, attention operates through the scaled dot-product attention formula:

Attention(Q,K,V) = softmax(QK^T/√d_k)V

Multi-head attention extends this by running multiple attention operations in parallel, each focusing on different types of relationships in the data.

Key Details

Computational Properties:

  • Attention scores are computed as dot products between query and key vectors
  • Scaling by √d_k prevents softmax saturation in high dimensions
  • Multi-head attention typically uses 8-16 heads in parallel
  • Self-attention occurs when queries, keys, and values come from the same sequence

Attention Patterns:

  • Causal/Masked Attention: Prevents attending to future tokens in autoregressive models
  • Bidirectional Attention: Allows attending to entire sequence (used in BERT-style models)
  • Cross-Attention: Attends between different sequences (encoder-decoder architectures)

Efficiency Considerations:

  • Standard attention has O(n²) complexity with sequence length
  • Various approximations exist for long sequences (Linear Attention, sparse attention)
  • Context Parallelism enables efficient processing of long sequences

Memory and Adaptation:

  • Attention weights can be viewed as a form of dynamic memory access
  • Fast Weights approaches use attention-like mechanisms for rapid adaptation
  • Test-Time Training frameworks can leverage attention for dynamic parameter updates

Relationships

Sources