Linear Attention

Summary: An efficient attention mechanism that achieves linear computational complexity O(n) with respect to sequence length, compared to the quadratic O(n²) complexity of standard attention. Linear attention enables processing of much longer sequences while maintaining the core benefits of attention-based modeling.

Overview

Linear attention represents a fundamental redesign of the attention mechanism to address the computational bottleneck that prevents transformers from scaling to very long sequences. Traditional attention computes pairwise interactions between all tokens, resulting in quadratic complexity that becomes prohibitive for sequences beyond a few thousand tokens.

The key insight behind linear attention is to approximate the attention computation through low-rank decompositions or kernel methods that avoid explicitly computing the full attention matrix. Instead of computing attention weights as softmax(QK^T/√d), linear attention typically uses alternative formulations like:

Kernel-based attention using φ(q)φ(k)^T where φ is a feature mapping
Low-rank approximations of the attention matrix
Reformulations that leverage associativity to change the order of operations

This reformulation enables the processing of sequences that would be computationally intractable with standard attention, making it particularly valuable for applications requiring Long-Context Modeling such as document analysis, code generation, and scientific text processing.

Key Details

Computational Complexity: Reduces attention complexity from O(n²d) to O(nd²) or O(nd) depending on the specific variant, where n is sequence length and d is feature dimension.

Memory Requirements: Linear attention significantly reduces memory usage by avoiding the storage of the full n×n attention matrix, enabling processing of sequences with 100k+ tokens on standard hardware.

Performance Trade-offs: While more efficient, linear attention may sacrifice some modeling capability compared to full attention, particularly for tasks requiring precise long-range dependencies or complex positional reasoning.

Variants: Multiple approaches exist including:

Performer (using random Fourier features)
Linear Transformer (using kernel approximations)
Linformer (using low-rank projections)
FNet (replacing attention with Fourier transforms)

Training Considerations: Models with linear attention can often be trained on longer sequences from the start, unlike approaches that extend context length post-training.

Relationships

Attention Mechanisms — Linear attention is a computationally efficient variant of the core attention mechanism
State Space Models — Share the goal of linear complexity for long sequences, often used as alternatives to linear attention
Context Parallelism — Linear attention enables more effective parallel processing of long contexts
Fast Weights — Some linear attention variants use fast weight concepts for efficient state updates
Test-Time Training — Linear attention's efficiency makes it more feasible to perform parameter updates during inference
Memory Augmented Networks — Both address the challenge of processing and remembering information from long sequences
Parameter Efficient Fine-tuning — Linear attention models often require different fine-tuning strategies due to their architectural differences

Sources

raw/articles/in-place-test-time-training — Referenced linear attention as a related concept for efficient long-context processing and scalable model architectures