Diffusion Models for Synthetic Time Series and Forecasting

This chapter introduces diffusion models for financial time series applications. Originally developed for image generation (Stable Diffusion, DALL-E), diffusion models have emerged as a powerful alternative to GANs for generating synthetic financial data and probabilistic forecasting.

Content

Diffusion Models: From Images to Time Series
Key Architectures for Time Series
Code Examples
Rust Implementation
Practical Considerations
Resources

Diffusion Models: From Images to Time Series

Diffusion models are a class of generative models that learn to generate data by reversing a gradual noising process. Unlike GANs, which learn through adversarial training, diffusion models learn to denoise data step by step.

The Intuition Behind Diffusion

The key insight is that it’s easier to learn small denoising steps than to generate data in one shot:

Forward Process: Gradually add Gaussian noise to data until it becomes pure noise
Reverse Process: Learn to undo each noise step, recovering the original data

This approach offers several advantages over GANs:

Training stability: No adversarial dynamics, mode collapse is rare
Quality: Often produces higher-quality samples
Uncertainty quantification: Natural probabilistic interpretation

Forward Process: Adding Noise

The forward diffusion process is a Markov chain that gradually adds Gaussian noise:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

Where:

$x_0$ is the original data
$x_t$ is the noised data at step $t$
$\beta_t$ is the noise schedule (typically $\beta_t \in [0.0001, 0.02]$)
$T$ is the total number of diffusion steps (typically 1000)

A key property allows direct sampling at any timestep:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$$

where $\bar{\alpha}t = \prod{s=1}^{t} (1-\beta_s)$

Reverse Process: Denoising

The reverse process learns to denoise:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

In practice, we train a neural network $\epsilon_\theta(x_t, t)$ to predict the noise added at each step. The training objective is:

$$\mathcal{L} = \mathbb{E}{t, x_0, \epsilon} \left[ | \epsilon - \epsilon\theta(x_t, t) |^2 \right]$$

Noise Schedules

The choice of noise schedule $\beta_t$ significantly affects performance:

Schedule	Formula	Characteristics
Linear	$\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)$	Simple, works well for images
Cosine	$\bar{\alpha}_t = \frac{f(t)}{f(0)}$, $f(t) = \cos^2(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2})$	Better for smaller images/sequences
Sigmoid	$\beta_t = \sigma(-6 + 12\frac{t}{T})$	Smooth transitions

Key Architectures for Time Series

TimeGrad: Autoregressive Diffusion

TimeGrad (Rasul et al., 2021) combines autoregressive modeling with diffusion:

Input: x_{1:t} (historical observations)
├── RNN Encoder → Hidden state h_t
├── Diffusion Process conditioned on h_t
└── Output: p(x_{t+1:t+τ} | x_{1:t})

Key features:

Uses RNN/LSTM to encode history
Diffusion generates future conditioned on hidden state
Autoregressive: generates one step at a time

Limitations: Cumulative errors in autoregressive generation, slow inference.

CSDI: Conditional Score-based Diffusion

CSDI (Tashiro et al., NeurIPS 2021) uses attention-based conditioning:

Input: Partially observed time series with mask
├── Temporal Attention (across time)
├── Feature Attention (across variables)
├── Score-based diffusion conditioned on observed values
└── Output: Imputed/forecasted values with uncertainty

Key features:

Self-supervised training with random masking
Handles both imputation and forecasting
40-65% improvement over existing probabilistic methods
Generates probabilistic forecasts (multiple samples)

Diffusion-TS: Decomposed Representations

Diffusion-TS (ICLR 2024) introduces interpretable decomposition:

Input: Time series x
├── Encoder-Decoder Transformer
├── Decomposition:
│   ├── Trend: Polynomial regression
│   └── Seasonal: Fourier series
├── Diffusion in decomposed space
└── Output: Interpretable synthetic series

Key features:

Reconstructs samples directly (not noise)
Fourier-based loss for spectral accuracy
Same architecture for generation, forecasting, imputation
State-of-the-art on Stocks, Energy, ETTh datasets

TimeDiff and Recent Advances (2024-2025)

Recent developments have addressed key limitations:

Method	Innovation	Performance
TimeDiff	Conditional diffusion at past-future boundary	9-47% improvement over baselines
ARMD	Auto-regressive moving diffusion	Best MSE on Exchange Rates
SimDiff (2025)	Simplified architecture, faster inference	Competitive with 10x fewer params
MG-TSD	Multi-granularity temporal structures	SOTA on long-term forecasting
S²DBM	Brownian Bridge dynamics	Natural boundary conditions

Code Examples

01: Diffusion Fundamentals

The notebook 01_diffusion_fundamentals.ipynb covers:

Forward and reverse diffusion processes
Noise schedules visualization
ELBO derivation and loss functions
Simple 1D diffusion example

02: DDPM Implementation from Scratch

The notebook 02_ddpm_from_scratch.ipynb implements:

Complete DDPM training loop
U-Net architecture for time series
Sampling algorithms (DDPM, DDIM)
Visualization of denoising process

03: TimeGrad for Cryptocurrency Forecasting

The notebook 03_timegrad_crypto.ipynb demonstrates:

TimeGrad architecture with RNN encoder
Training on Bitcoin/Ethereum hourly data
Probabilistic forecasting with confidence intervals
Comparison with LSTM baselines

04: CSDI for Imputation and Forecasting

The notebook 04_csdi_imputation_forecasting.ipynb shows:

CSDI implementation with attention
Missing data imputation in OHLCV series
Probabilistic forecasting
Evaluation metrics (CRPS, calibration)

05: Diffusion-TS for Synthetic Data Generation

The notebook 05_diffusion_ts_synthetic.ipynb covers:

Generating synthetic cryptocurrency data
Trend-seasonal decomposition
Quality evaluation (discriminative score, FID)
Comparison with TimeGAN (Chapter 21)

06: Diffusion vs GANs Comparison

The notebook 06_diffusion_vs_gans.ipynb compares:

Training stability (diffusion vs GAN)
Sample quality metrics
Diversity vs fidelity trade-offs
Computational requirements

07: Complete Bitcoin Forecasting Pipeline

The notebook 07_bitcoin_pipeline.ipynb provides:

End-to-end production pipeline
Technical indicator features (RSI, volatility)
Monte Carlo uncertainty estimation
Backtesting with realistic constraints

Rust Implementation

The rust_diffusion_crypto directory contains a Rust implementation using the tch-rs (PyTorch bindings) and burn frameworks:

rust_diffusion_crypto/
├── Cargo.toml
├── README.md
├── src/
│   ├── lib.rs
│   ├── main.rs
│   ├── data/           # Bybit API client, preprocessing
│   ├── model/          # DDPM, U-Net, noise schedules
│   ├── training/       # Training loop, losses
│   └── utils/          # Config, checkpoints
└── examples/
    ├── fetch_data.rs
    ├── train_ddpm.rs
    └── forecast.rs

See rust_diffusion_crypto/README.md for detailed usage.

Practical Considerations

When to Use Diffusion Models

Good use cases:

Synthetic data generation for backtesting
Probabilistic forecasting with uncertainty
Missing data imputation
Scenario generation for risk analysis
When you need diversity in generated samples

Not ideal for:

Real-time/low-latency predictions (slow inference)
Limited computational resources
When model interpretability is critical
Very short sequences (<24 timesteps)

Computational Requirements

Task	GPU Memory	Training Time	Inference Time
Simple DDPM	4GB	2-4 hours	100ms/sample
TimeGrad	8GB	8-12 hours	500ms/sample
CSDI	12GB	12-24 hours	200ms/sample
Diffusion-TS	8GB	6-10 hours	150ms/sample

Optimization Techniques

DDIM Sampling: Reduce steps from 1000 to 50-100 with minimal quality loss
Token Merging: tomesd speeds up 1.24x
Distillation: Train smaller student models
Caching: Cache attention computations across steps

Resources

Diffusion Models for Synthetic Time Series and Forecasting

Diffusion Models for Synthetic Time Series and Forecasting

Content

Diffusion Models: From Images to Time Series

The Intuition Behind Diffusion

Forward Process: Adding Noise

Reverse Process: Denoising

Noise Schedules

Key Architectures for Time Series

TimeGrad: Autoregressive Diffusion

CSDI: Conditional Score-based Diffusion

Diffusion-TS: Decomposed Representations

TimeDiff and Recent Advances (2024-2025)

Code Examples

01: Diffusion Fundamentals

02: DDPM Implementation from Scratch

03: TimeGrad for Cryptocurrency Forecasting

04: CSDI for Imputation and Forecasting

05: Diffusion-TS for Synthetic Data Generation

06: Diffusion vs GANs Comparison

07: Complete Bitcoin Forecasting Pipeline

Rust Implementation

Practical Considerations

When to Use Diffusion Models

Computational Requirements

Optimization Techniques

Resources

Papers

Implementations

Tutorials & Guides

Related Chapters