Chapter 282: Masked Autoencoders for Financial Time Series
Chapter 282: Masked Autoencoders for Financial Time Series
Introduction
Masked Autoencoders (MAE) have emerged as a powerful self-supervised pre-training strategy, originally popularized in computer vision by He et al. (2021). The core idea is remarkably simple: randomly mask a large portion of input data, then train a model to reconstruct the missing pieces. This forces the model to learn rich, generalizable representations of the underlying data distribution without requiring any labels.
In finance, labeled data is notoriously scarce and expensive, while unlabeled price and volume data is abundant. MAE provides a natural bridge: pre-train on vast amounts of unlabeled market data, then fine-tune on small labeled datasets for downstream tasks such as trend prediction, volatility forecasting, or anomaly detection. This chapter explores the theoretical foundations of MAE, adapts the architecture for financial time series, and provides a complete Rust implementation with Bybit market data integration.
Theoretical Foundations
The Masked Autoencoder Framework
A Masked Autoencoder operates on a sequence of input tokens (or patches). Given an input sequence $\mathbf{X} = [x_1, x_2, \ldots, x_T]$, the MAE process involves three stages:
1. Masking: A random subset $\mathcal{M} \subset {1, 2, \ldots, T}$ of positions is selected for masking. The masking ratio $r = |\mathcal{M}| / T$ is typically high (60-80%). The visible set is $\mathcal{V} = {1, \ldots, T} \setminus \mathcal{M}$.
2. Encoding: Only the visible (unmasked) tokens are passed through the encoder:
$$\mathbf{H}\mathcal{V} = f{\text{enc}}(\mathbf{X}_\mathcal{V})$$
This asymmetric design (encoding only visible tokens) dramatically reduces computation during pre-training.
3. Decoding: The decoder takes the encoded visible tokens plus learnable mask tokens $\mathbf{m}$ and reconstructs the full sequence:
$$\hat{\mathbf{X}} = f_{\text{dec}}([\mathbf{H}\mathcal{V}; \mathbf{m}\mathcal{M}])$$
Reconstruction Loss
The training objective minimizes the Mean Squared Error (MSE) over the masked positions only:
$$\mathcal{L}{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum{i \in \mathcal{M}} | x_i - \hat{x}_i |^2$$
Computing loss only on masked positions forces the model to truly predict missing information rather than simply copying visible inputs.
Positional Encoding
Position information is critical for time series data. We use sinusoidal positional encoding:
$$PE_{(pos, 2k)} = \sin\left(\frac{pos}{10000^{2k/d}}\right)$$ $$PE_{(pos, 2k+1)} = \cos\left(\frac{pos}{10000^{2k/d}}\right)$$
where $pos$ is the position index and $k$ is the dimension index.
Adaptation for Financial Time Series
Financial time series present unique challenges compared to images:
-
Temporal ordering matters: Unlike image patches where spatial relationships are relatively uniform, financial data has strict temporal causality.
-
Multi-scale patterns: Markets exhibit patterns at multiple time scales (intraday, daily, weekly, monthly).
-
Non-stationarity: Statistical properties of financial data change over time, requiring robust feature extraction.
-
Feature heterogeneity: OHLCV data combines price levels, ranges, and volume information with different scales and distributions.
Our financial MAE adaptation addresses these by:
- Using overlapping temporal patches to capture local structure
- Normalizing features within each patch to handle non-stationarity
- Applying learnable linear projections for patch embedding
- Using a high masking ratio (75%) to prevent trivial interpolation
Patch Construction for OHLCV Data
Given OHLCV time series $\mathbf{X} \in \mathbb{R}^{T \times 5}$, we create patches of size $P$:
$$\text{patch}_i = \text{flatten}(\mathbf{X}[iP : (i+1)P, :]) \in \mathbb{R}^{5P}$$
Each patch is then projected to the model dimension $d$:
$$\mathbf{e}i = \mathbf{W}{\text{proj}} \cdot \text{patch}i + \mathbf{b}{\text{proj}} + \mathbf{PE}_i$$
where $\mathbf{W}_{\text{proj}} \in \mathbb{R}^{d \times 5P}$ is a learnable projection matrix.
Architecture Details
Encoder
The encoder consists of $L_e$ transformer blocks operating on visible tokens only:
$$\mathbf{H}^{(l+1)} = \text{TransformerBlock}(\mathbf{H}^{(l)})$$
Each transformer block contains:
- Multi-Head Self-Attention (MHSA)
- Feed-Forward Network (FFN) with GELU activation
- Layer normalization and residual connections
Decoder
The decoder is a lightweight network with $L_d$ transformer blocks ($L_d < L_e$). It receives the full sequence (encoded visible tokens + mask tokens at masked positions) and produces per-patch reconstructions:
$$\hat{\mathbf{p}}i = \mathbf{W}{\text{out}} \cdot \mathbf{D}i^{(L_d)} + \mathbf{b}{\text{out}}$$
where $\hat{\mathbf{p}}_i \in \mathbb{R}^{5P}$ is the reconstructed patch.
Fine-Tuning for Trading
After pre-training, the decoder is discarded and the encoder is used for downstream tasks:
- Trend Prediction: Attach a classification head to predict up/down movement.
- Volatility Forecasting: Attach a regression head for future volatility estimation.
- Anomaly Detection: Use reconstruction error as an anomaly score — high reconstruction error indicates unusual market behavior.
The fine-tuning loss for classification:
$$\mathcal{L}{\text{fine-tune}} = -\sum{c} y_c \log(\hat{y}_c)$$
Implementation in Rust
Our Rust implementation provides production-grade performance for MAE-based trading systems. Key components include:
Core Data Structures
The implementation uses ndarray for efficient matrix operations. The MaskedAutoencoder struct encapsulates the entire architecture:
PatchEmbedding: Converts raw OHLCV patches into embeddingsMaskedAutoencoder: Full MAE with encoder, decoder, masking logic, and trainingTradingStrategy: Wraps a pre-trained MAE for generating trading signals
Random Masking Strategy
The masking function generates a random permutation of patch indices and selects the first $\lfloor r \cdot N \rfloor$ as masked positions. This ensures exactly the desired masking ratio each time.
Bybit Integration
The implementation fetches real market data from Bybit’s public API:
let url = format!( "https://api.bybit.com/v5/market/kline?category=linear&symbol={}&interval={}&limit={}", symbol, interval, limit);The response is parsed into OHLCV candles used for both pre-training and strategy evaluation.
Training Loop
Pre-training iterates over the dataset with random masking each epoch:
- Select random masking pattern
- Extract visible patches and pass through encoder
- Insert mask tokens and pass through decoder
- Compute MSE loss on masked positions
- Update weights via gradient descent
Trading Signal Generation
After pre-training, the MAE encoder generates feature representations. The reconstruction error on recent data serves as an anomaly indicator:
- Low reconstruction error: Market behaves normally — follow trend signals
- High reconstruction error: Unusual market conditions — reduce position size or go to cash
Practical Applications
1. Pre-Training on Historical Data
Use years of historical OHLCV data to pre-train the MAE. The model learns:
- Typical candlestick patterns
- Volume-price relationships
- Mean-reversion and momentum signatures
- Volatility clustering
2. Few-Shot Learning for New Markets
A MAE pre-trained on BTC/USDT can be fine-tuned with minimal labeled data for ETH/USDT or other pairs, transferring knowledge about general market microstructure.
3. Regime Detection
Tracking reconstruction error over time reveals market regime changes. A sudden spike in error indicates the market is behaving unlike historical norms.
4. Data Augmentation
The MAE decoder can generate synthetic market scenarios by varying the mask positions and reconstructions, useful for stress testing trading strategies.
Why High Masking Ratios Work for Finance
In financial time series, adjacent patches are highly correlated (prices are continuous, volatility clusters). With low masking ratios (e.g., 25%), the model can trivially interpolate. At 75% masking, the model must:
- Understand the underlying volatility regime
- Infer trend direction from sparse observations
- Model the joint distribution of OHLCV features
- Capture long-range dependencies across the visible patches
This creates a challenging pretext task that yields rich representations.
Preprocessing for Financial MAE
Financial data requires careful normalization:
- Log returns: Convert prices to log returns for stationarity: $r_t = \ln(P_t / P_{t-1})$
- Volume normalization: Apply log transform and z-score normalization
- Per-patch normalization: Subtract patch mean and divide by patch standard deviation before computing reconstruction loss
- Handling missing data: Mark truly missing data separately from artificial masking
Advanced Masking Strategies for Finance
Beyond uniform random masking, financial-specific strategies include:
- Block masking: Mask contiguous time blocks to force longer-range prediction
- Feature masking: Mask specific features (e.g., all volume data) to learn cross-feature dependencies
- Regime-aware masking: Increase masking during volatile periods where reconstruction is harder
- Multi-scale masking: Apply different masking ratios at different temporal scales
Transfer to Downstream Tasks
Regime Detection
The pretrained encoder captures latent market states. Fine-tuning for regime detection:
- Freeze the MAE encoder (or use low learning rate)
- Add a classification head: $\hat{y} = \text{softmax}(W_c \cdot h_{\text{[CLS]}} + b_c)$
- Train on labeled regime data (bull/bear/sideways)
The pretrained representations typically require 10x fewer labeled samples than training from scratch.
Anomaly Detection
The reconstruction error itself serves as an anomaly score:
$$\text{anomaly_score}(X) = \frac{1}{N} \sum_{i=1}^{N} |x_i - \hat{x}_i|_2^2$$
Anomalous market conditions (flash crashes, circuit breakers, extreme events) produce high reconstruction errors because they deviate from learned normal patterns. This provides a natural, unsupervised anomaly detector.
Return Prediction
Fine-tune the encoder for next-window return prediction:
- Use the encoder’s output representation
- Add a regression head for predicting returns over the next $k$ periods
- The pretrained features capture market dynamics that improve prediction accuracy
Risk Assessment
The decoder’s uncertainty in reconstruction correlates with market uncertainty:
- Patches that are consistently hard to reconstruct indicate unpredictable market regimes
- This provides a data-driven volatility/risk indicator
Performance Considerations
| Aspect | Detail |
|---|---|
| Pre-training data | 10,000+ candles recommended |
| Masking ratio | 75% (optimal for financial data) |
| Patch size | 4-8 candles per patch |
| Encoder depth | 4-6 transformer blocks |
| Decoder depth | 2 blocks (lightweight) |
| Embedding dimension | 64-128 |
| Learning rate | 1e-4 with cosine schedule |
| Batch size | 32-64 |
Risk Management Integration
The MAE reconstruction error provides a natural uncertainty measure:
$$\text{uncertainty}t = \frac{1}{|\mathcal{M}|} \sum{i \in \mathcal{M}} | x_i^{(t)} - \hat{x}_i^{(t)} |^2$$
This can be incorporated into position sizing:
$$\text{position_size}t = \text{base_size} \times \frac{\sigma{\text{target}}}{\sigma_t + \epsilon} \times \mathbb{1}[\text{uncertainty}_t < \tau]$$
where $\tau$ is an uncertainty threshold above which no position is taken.
Key Takeaways
-
Masked Autoencoders provide a powerful self-supervised pre-training framework that learns rich representations from unlabeled financial data by reconstructing randomly masked portions of the input.
-
High masking ratios (75%) are beneficial for financial time series — they prevent the model from relying on trivial interpolation and force it to learn meaningful patterns.
-
Asymmetric encoder-decoder design — a deep encoder with a shallow decoder — yields efficient pre-training where the encoder captures the most important features.
-
Reconstruction error as anomaly score provides a built-in uncertainty measure for risk management, enabling the strategy to reduce exposure during unusual market conditions.
-
Transfer learning is a key benefit: a MAE pre-trained on one instrument can be fine-tuned with minimal data for new instruments, addressing the perennial problem of limited labeled data in finance.
-
Patch-based tokenization of OHLCV data captures local temporal structure while enabling the transformer architecture to model long-range dependencies across the sequence.
-
Production Rust implementation delivers the performance needed for real-time trading applications, with efficient matrix operations via
ndarrayand live data integration through Bybit’s API.
References
- He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2021). Masked Autoencoders Are Scalable Vision Learners. arXiv:2111.06377.
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
- Assran, M., et al. (2022). Masked Siamese Networks for Label-Efficient Learning. ECCV 2022.
- Dong, X., et al. (2022). PatchTST: A Time Series is Worth 64 Words. ICLR 2023.
- Zerveas, G., et al. (2021). A Transformer-based Framework for Multivariate Time Series Representation Learning. KDD 2021.