Linear Attention
Summary: An efficient attention mechanism that achieves linear computational complexity O(n) with respect to sequence length, compared to the quadratic O(n²) complexity of standard attention. Linear attention enables processing of much longer sequences while maintaining the core benefits of attention-based modeling.
Overview
Linear attention represents a fundamental redesign of the attention mechanism to address the computational bottleneck that prevents transformers from scaling to very long sequences. Traditional attention computes pairwise interactions between all tokens, resulting in quadratic complexity that becomes prohibitive for sequences beyond a few thousand tokens.
The key insight behind linear attention is to approximate the attention computation through low-rank decompositions or kernel methods that avoid explicitly computing the full attention matrix. Instead of computing attention weights as softmax(QK^T/√d), linear attention typically uses alternative formulations like:
- Kernel-based attention using φ(q)φ(k)^T where φ is a feature mapping
- Low-rank approximations of the attention matrix
- Reformulations that leverage associativity to change the order of operations
This reformulation enables the processing of sequences that would be computationally intractable with standard attention, making it particularly valuable for applications requiring Long-Context Modeling such as document analysis, code generation, and scientific text processing.
Key Details
Computational Complexity: Reduces attention complexity from O(n²d) to O(nd²) or O(nd) depending on the specific variant, where n is sequence length and d is feature dimension.
Memory Requirements: Linear attention significantly reduces memory usage by avoiding the storage of the full n×n attention matrix, enabling processing of sequences with 100k+ tokens on standard hardware.
Performance Trade-offs: While more efficient, linear attention may sacrifice some modeling capability compared to full attention, particularly for tasks requiring precise long-range dependencies or complex positional reasoning.
Variants: Multiple approaches exist including:
- Performer (using random Fourier features)
- Linear Transformer (using kernel approximations)
- Linformer (using low-rank projections)
- FNet (replacing attention with Fourier transforms)
Training Considerations: Models with linear attention can often be trained on longer sequences from the start, unlike approaches that extend context length post-training.
Relationships
- Attention Mechanisms — Linear attention is a computationally efficient variant of the core attention mechanism
- State Space Models — Share the goal of linear complexity for long sequences, often used as alternatives to linear attention
- Context Parallelism — Linear attention enables more effective parallel processing of long contexts
- Fast Weights — Some linear attention variants use fast weight concepts for efficient state updates
- Test-Time Training — Linear attention's efficiency makes it more feasible to perform parameter updates during inference
- Memory Augmented Networks — Both address the challenge of processing and remembering information from long sequences
- Parameter Efficient Fine-tuning — Linear attention models often require different fine-tuning strategies due to their architectural differences
Sources
- raw/articles/in-place-test-time-training — Referenced linear attention as a related concept for efficient long-context processing and scalable model architectures