Chapter 134: Linear Attention and State Space Models (SSM)
Chapter 134: Linear Attention and State Space Models (SSM)
Overview
For years, the deep learning community saw Attention mechanisms (which power Transformers) and State Space Models (which trace back to continuous-time control theory and Kalman filters) as two fundamentally different approaches to sequential modeling. However, the groundbreaking 2024 paper “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality” (Dao & Gu) demonstrated a total mathematical equivalence.
In algorithmic trading—specifically when processing high-frequency Limit Order Book (LOB) tick data—classic Attention collapses. Calculating $O(N^2)$ cross-comparisons every time a new bid or ask arrives is computationally fatal.
This chapter introduces the Structured State Space Duality (SSD) and Gated Linear Attention (GLA), explaining how to reformulate the Transformer into an $O(N)$ Recurrent Neural Network (RNN) that maintains a constant memory matrix. You learn how to feed millions of ticks from exchanges like Bybit into an endless memory stream that forgets and adapts instantly.
Table of Contents
- Introduction to Linear Attention vs Standard Attention
- Mathematical Foundations: Transformers are SSMs
- Gated Linear Attention (GLA) & Forgetting Mechanisms
- Trading Context: High-Frequency Inference
- Python Implementations
- Rust Engine Implementation
- Backtesting Framework
- References
Introduction to Linear Attention vs Standard Attention
The $O(N^2)$ Problem
Standard Attention relies on the Softmax kernel:
$$ \text{Output} = \text{softmax}(Q \cdot K^T) \cdot V $$
Because Softmax is applied to the combination of $Q$ (Query) and $K$ (Key), you must calculate $Q \cdot K^T$ before doing anything else. For a sequence length $N$, this requires an $N \times N$ matrix. If $N$ is 100,000 ticks, you run out of GPU memory.
The $O(N)$ Solution
Linear Attention drops the Softmax and applies a positive feature map $\phi(x)$ to $Q$ and $K$ separately:
$$ \text{Output} = \phi(Q) \cdot (\phi(K)^T \cdot V) $$
By the associative property of matrix multiplication, we group $(\phi(K)^T \cdot V)$ first! This represents a fixed-size state matrix, $S$, scaling linearly.
Mathematical Foundations: Transformers are SSMs
When we drop the Softmax and expand linear attention chronologically over sequence steps $t$, an incredibly profound link is uncovered. The attention sum can be strictly mathematically rewritten as a recurrent step equation:
$$ S_t = S_{t-1} + \phi(K_t)^T \cdot V_t $$ $$ O_t = \phi(Q_t) \cdot S_t $$
This implies that Linear Attention is just an RNN. The matrix $S_t$ serves as the “Hidden State”. By expanding this continuous ODE equation framework, Dao and Gu demonstrated that state space models (SSMs) and attention mechanisms share a continuous duality.
Gated Linear Attention (GLA) & Forgetting Mechanisms
If we add a data-dependent exponential decay factor $A$ (which mirrors the transition matrix in SSMs), we get Structured State Space Duality (SSD) or Gated Linear Attention:
$$ S_t = A \cdot S_{t-1} + K_t^T \cdot V_t $$
In trading, $A$ lets the model selectively “forget”. When a massive news event hits the market, the model’s gates shrink the exponential $A$ factor, erasing the outdated history matrix in milliseconds and allowing it to adapt to a violently shifting volatility regime without resetting.
Trading Context: High-Frequency Inference
Limit Order Books (LOBs)
Because the state matrix $S$ is constant in size, you can maintain a live continuous trading bot.
- Subscribe to Bybit Websocket streams.
- Every 1ms, receive a new LOB matrix $x_t$.
- Extract features $K_t$ and $V_t$, and simply literally add them to the persistent matrix memory $S$, multiplying by a decay factor.
- $O_t$ provides the future probabilistic price trend. The total inference time per tick is $O(1)$.
Python Implementations
The python/ directory contains deep neural network architectures mimicking structured space dualities.
01. The Core Duality Cell
- File:
python/model.pyThis module containsLinearAttentionSSMCell. It explicitly breaks down the matrix multiplications, dropping softmax, enforcing a continuous decay factor $\exp(A)$, and yielding the exact dimension mappings proven by ICML 2024.
python python/model.py02. The Training Loop
- File:
python/train.pyA module generating simulated multidimensional LOB tick-streams representing Bybit feeds. It incorporates gradient clipping for RNN stabilization and continuously optimizes the Linear attention memory using Mean Squared Error.
python python/train.py03. Example Notebook
- File:
python/notebooks/example.ipynbJupyter playground for evaluating the differences between Softmax context exploding vs fast-running State Space states.
Rust Engine Implementation
Live High-Frequency Trading systems cannot afford Python’s Garbage Collection pauses.
The rust/ directory demonstrates exactly how to carry the continuous State Matrix forward over raw Float streams.
- Core Library:
rust/src/lib.rs - Live Inference Execution:
rust/src/main.rs
The code mimics how SSD matrices update dynamically without allocating new memory, providing microsecond-level reactions to LOB imbalances.
cd rustcargo runBacktesting Framework
Because Linear Attention is sequential, you don’t calculate arrays of indicators. You stream a single price line through the hidden state.
- File:
python/backtest.py
By retaining memory across 10,000 steps uniformly, the script triggers entries off internal SSM probabilities and reports real metrics (Sharpe, Drawdown, PnL).
python python/backtest.pyReferences
- Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024. arXiv:2405.21060
- Yang, S., et al. (2024). Gated Linear Attention for Long-Context Modeling.
- Chapter 133: HIPPO Framework - Fundamental mathematical basis polynomials for State Space continuous projections.
- Chapter 135: Bidirectional Mamba