Skip to content

Chapter 134: Linear Attention and State Space Models (SSM)

Chapter 134: Linear Attention and State Space Models (SSM)

Overview

For years, the deep learning community saw Attention mechanisms (which power Transformers) and State Space Models (which trace back to continuous-time control theory and Kalman filters) as two fundamentally different approaches to sequential modeling. However, the groundbreaking 2024 paper “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality” (Dao & Gu) demonstrated a total mathematical equivalence.

In algorithmic trading—specifically when processing high-frequency Limit Order Book (LOB) tick data—classic Attention collapses. Calculating $O(N^2)$ cross-comparisons every time a new bid or ask arrives is computationally fatal.

This chapter introduces the Structured State Space Duality (SSD) and Gated Linear Attention (GLA), explaining how to reformulate the Transformer into an $O(N)$ Recurrent Neural Network (RNN) that maintains a constant memory matrix. You learn how to feed millions of ticks from exchanges like Bybit into an endless memory stream that forgets and adapts instantly.

Table of Contents

  1. Introduction to Linear Attention vs Standard Attention
  2. Mathematical Foundations: Transformers are SSMs
  3. Gated Linear Attention (GLA) & Forgetting Mechanisms
  4. Trading Context: High-Frequency Inference
  5. Python Implementations
  6. Rust Engine Implementation
  7. Backtesting Framework
  8. References

Introduction to Linear Attention vs Standard Attention

The $O(N^2)$ Problem

Standard Attention relies on the Softmax kernel: $$ \text{Output} = \text{softmax}(Q \cdot K^T) \cdot V $$

Because Softmax is applied to the combination of $Q$ (Query) and $K$ (Key), you must calculate $Q \cdot K^T$ before doing anything else. For a sequence length $N$, this requires an $N \times N$ matrix. If $N$ is 100,000 ticks, you run out of GPU memory.

The $O(N)$ Solution

Linear Attention drops the Softmax and applies a positive feature map $\phi(x)$ to $Q$ and $K$ separately:

$$ \text{Output} = \phi(Q) \cdot (\phi(K)^T \cdot V) $$

By the associative property of matrix multiplication, we group $(\phi(K)^T \cdot V)$ first! This represents a fixed-size state matrix, $S$, scaling linearly.


Mathematical Foundations: Transformers are SSMs

When we drop the Softmax and expand linear attention chronologically over sequence steps $t$, an incredibly profound link is uncovered. The attention sum can be strictly mathematically rewritten as a recurrent step equation:

$$ S_t = S_{t-1} + \phi(K_t)^T \cdot V_t $$ $$ O_t = \phi(Q_t) \cdot S_t $$

This implies that Linear Attention is just an RNN. The matrix $S_t$ serves as the “Hidden State”. By expanding this continuous ODE equation framework, Dao and Gu demonstrated that state space models (SSMs) and attention mechanisms share a continuous duality.

Gated Linear Attention (GLA) & Forgetting Mechanisms

If we add a data-dependent exponential decay factor $A$ (which mirrors the transition matrix in SSMs), we get Structured State Space Duality (SSD) or Gated Linear Attention:

$$ S_t = A \cdot S_{t-1} + K_t^T \cdot V_t $$

In trading, $A$ lets the model selectively “forget”. When a massive news event hits the market, the model’s gates shrink the exponential $A$ factor, erasing the outdated history matrix in milliseconds and allowing it to adapt to a violently shifting volatility regime without resetting.


Trading Context: High-Frequency Inference

Limit Order Books (LOBs)

Because the state matrix $S$ is constant in size, you can maintain a live continuous trading bot.

  1. Subscribe to Bybit Websocket streams.
  2. Every 1ms, receive a new LOB matrix $x_t$.
  3. Extract features $K_t$ and $V_t$, and simply literally add them to the persistent matrix memory $S$, multiplying by a decay factor.
  4. $O_t$ provides the future probabilistic price trend. The total inference time per tick is $O(1)$.

Python Implementations

The python/ directory contains deep neural network architectures mimicking structured space dualities.

01. The Core Duality Cell

  • File: python/model.py This module contains LinearAttentionSSMCell. It explicitly breaks down the matrix multiplications, dropping softmax, enforcing a continuous decay factor $\exp(A)$, and yielding the exact dimension mappings proven by ICML 2024.
Terminal window
python python/model.py

02. The Training Loop

  • File: python/train.py A module generating simulated multidimensional LOB tick-streams representing Bybit feeds. It incorporates gradient clipping for RNN stabilization and continuously optimizes the Linear attention memory using Mean Squared Error.
Terminal window
python python/train.py

03. Example Notebook

  • File: python/notebooks/example.ipynb Jupyter playground for evaluating the differences between Softmax context exploding vs fast-running State Space states.

Rust Engine Implementation

Live High-Frequency Trading systems cannot afford Python’s Garbage Collection pauses.

The rust/ directory demonstrates exactly how to carry the continuous State Matrix forward over raw Float streams.

The code mimics how SSD matrices update dynamically without allocating new memory, providing microsecond-level reactions to LOB imbalances.

Terminal window
cd rust
cargo run

Backtesting Framework

Because Linear Attention is sequential, you don’t calculate arrays of indicators. You stream a single price line through the hidden state.

By retaining memory across 10,000 steps uniformly, the script triggers entries off internal SSM probabilities and reports real metrics (Sharpe, Drawdown, PnL).

Terminal window
python python/backtest.py

References

  1. Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024. arXiv:2405.21060
  2. Yang, S., et al. (2024). Gated Linear Attention for Long-Context Modeling.
  3. Chapter 133: HIPPO Framework - Fundamental mathematical basis polynomials for State Space continuous projections.
  4. Chapter 135: Bidirectional Mamba