Skip to content

Diffusion Models for Synthetic Time Series and Forecasting

Diffusion Models for Synthetic Time Series and Forecasting

This chapter introduces diffusion models for financial time series applications. Originally developed for image generation (Stable Diffusion, DALL-E), diffusion models have emerged as a powerful alternative to GANs for generating synthetic financial data and probabilistic forecasting.

Content

  1. Diffusion Models: From Images to Time Series
  2. Key Architectures for Time Series
  3. Code Examples
  4. Rust Implementation
  5. Practical Considerations
  6. Resources

Diffusion Models: From Images to Time Series

Diffusion models are a class of generative models that learn to generate data by reversing a gradual noising process. Unlike GANs, which learn through adversarial training, diffusion models learn to denoise data step by step.

The Intuition Behind Diffusion

The key insight is that it’s easier to learn small denoising steps than to generate data in one shot:

  1. Forward Process: Gradually add Gaussian noise to data until it becomes pure noise
  2. Reverse Process: Learn to undo each noise step, recovering the original data

This approach offers several advantages over GANs:

  • Training stability: No adversarial dynamics, mode collapse is rare
  • Quality: Often produces higher-quality samples
  • Uncertainty quantification: Natural probabilistic interpretation

Forward Process: Adding Noise

The forward diffusion process is a Markov chain that gradually adds Gaussian noise:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

Where:

  • $x_0$ is the original data
  • $x_t$ is the noised data at step $t$
  • $\beta_t$ is the noise schedule (typically $\beta_t \in [0.0001, 0.02]$)
  • $T$ is the total number of diffusion steps (typically 1000)

A key property allows direct sampling at any timestep:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$$

where $\bar{\alpha}t = \prod{s=1}^{t} (1-\beta_s)$

Reverse Process: Denoising

The reverse process learns to denoise:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

In practice, we train a neural network $\epsilon_\theta(x_t, t)$ to predict the noise added at each step. The training objective is:

$$\mathcal{L} = \mathbb{E}{t, x_0, \epsilon} \left[ | \epsilon - \epsilon\theta(x_t, t) |^2 \right]$$

Noise Schedules

The choice of noise schedule $\beta_t$ significantly affects performance:

ScheduleFormulaCharacteristics
Linear$\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)$Simple, works well for images
Cosine$\bar{\alpha}_t = \frac{f(t)}{f(0)}$, $f(t) = \cos^2(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2})$Better for smaller images/sequences
Sigmoid$\beta_t = \sigma(-6 + 12\frac{t}{T})$Smooth transitions

Key Architectures for Time Series

TimeGrad: Autoregressive Diffusion

TimeGrad (Rasul et al., 2021) combines autoregressive modeling with diffusion:

Input: x_{1:t} (historical observations)
├── RNN Encoder → Hidden state h_t
├── Diffusion Process conditioned on h_t
└── Output: p(x_{t+1:t+τ} | x_{1:t})

Key features:

  • Uses RNN/LSTM to encode history
  • Diffusion generates future conditioned on hidden state
  • Autoregressive: generates one step at a time

Limitations: Cumulative errors in autoregressive generation, slow inference.

CSDI: Conditional Score-based Diffusion

CSDI (Tashiro et al., NeurIPS 2021) uses attention-based conditioning:

Input: Partially observed time series with mask
├── Temporal Attention (across time)
├── Feature Attention (across variables)
├── Score-based diffusion conditioned on observed values
└── Output: Imputed/forecasted values with uncertainty

Key features:

  • Self-supervised training with random masking
  • Handles both imputation and forecasting
  • 40-65% improvement over existing probabilistic methods
  • Generates probabilistic forecasts (multiple samples)

Diffusion-TS: Decomposed Representations

Diffusion-TS (ICLR 2024) introduces interpretable decomposition:

Input: Time series x
├── Encoder-Decoder Transformer
├── Decomposition:
│ ├── Trend: Polynomial regression
│ └── Seasonal: Fourier series
├── Diffusion in decomposed space
└── Output: Interpretable synthetic series

Key features:

  • Reconstructs samples directly (not noise)
  • Fourier-based loss for spectral accuracy
  • Same architecture for generation, forecasting, imputation
  • State-of-the-art on Stocks, Energy, ETTh datasets

TimeDiff and Recent Advances (2024-2025)

Recent developments have addressed key limitations:

MethodInnovationPerformance
TimeDiffConditional diffusion at past-future boundary9-47% improvement over baselines
ARMDAuto-regressive moving diffusionBest MSE on Exchange Rates
SimDiff (2025)Simplified architecture, faster inferenceCompetitive with 10x fewer params
MG-TSDMulti-granularity temporal structuresSOTA on long-term forecasting
S²DBMBrownian Bridge dynamicsNatural boundary conditions

Code Examples

01: Diffusion Fundamentals

The notebook 01_diffusion_fundamentals.ipynb covers:

  • Forward and reverse diffusion processes
  • Noise schedules visualization
  • ELBO derivation and loss functions
  • Simple 1D diffusion example

02: DDPM Implementation from Scratch

The notebook 02_ddpm_from_scratch.ipynb implements:

  • Complete DDPM training loop
  • U-Net architecture for time series
  • Sampling algorithms (DDPM, DDIM)
  • Visualization of denoising process

03: TimeGrad for Cryptocurrency Forecasting

The notebook 03_timegrad_crypto.ipynb demonstrates:

  • TimeGrad architecture with RNN encoder
  • Training on Bitcoin/Ethereum hourly data
  • Probabilistic forecasting with confidence intervals
  • Comparison with LSTM baselines

04: CSDI for Imputation and Forecasting

The notebook 04_csdi_imputation_forecasting.ipynb shows:

  • CSDI implementation with attention
  • Missing data imputation in OHLCV series
  • Probabilistic forecasting
  • Evaluation metrics (CRPS, calibration)

05: Diffusion-TS for Synthetic Data Generation

The notebook 05_diffusion_ts_synthetic.ipynb covers:

  • Generating synthetic cryptocurrency data
  • Trend-seasonal decomposition
  • Quality evaluation (discriminative score, FID)
  • Comparison with TimeGAN (Chapter 21)

06: Diffusion vs GANs Comparison

The notebook 06_diffusion_vs_gans.ipynb compares:

  • Training stability (diffusion vs GAN)
  • Sample quality metrics
  • Diversity vs fidelity trade-offs
  • Computational requirements

07: Complete Bitcoin Forecasting Pipeline

The notebook 07_bitcoin_pipeline.ipynb provides:

  • End-to-end production pipeline
  • Technical indicator features (RSI, volatility)
  • Monte Carlo uncertainty estimation
  • Backtesting with realistic constraints

Rust Implementation

The rust_diffusion_crypto directory contains a Rust implementation using the tch-rs (PyTorch bindings) and burn frameworks:

rust_diffusion_crypto/
├── Cargo.toml
├── README.md
├── src/
│ ├── lib.rs
│ ├── main.rs
│ ├── data/ # Bybit API client, preprocessing
│ ├── model/ # DDPM, U-Net, noise schedules
│ ├── training/ # Training loop, losses
│ └── utils/ # Config, checkpoints
└── examples/
├── fetch_data.rs
├── train_ddpm.rs
└── forecast.rs

See rust_diffusion_crypto/README.md for detailed usage.

Practical Considerations

When to Use Diffusion Models

Good use cases:

  • Synthetic data generation for backtesting
  • Probabilistic forecasting with uncertainty
  • Missing data imputation
  • Scenario generation for risk analysis
  • When you need diversity in generated samples

Not ideal for:

  • Real-time/low-latency predictions (slow inference)
  • Limited computational resources
  • When model interpretability is critical
  • Very short sequences (<24 timesteps)

Computational Requirements

TaskGPU MemoryTraining TimeInference Time
Simple DDPM4GB2-4 hours100ms/sample
TimeGrad8GB8-12 hours500ms/sample
CSDI12GB12-24 hours200ms/sample
Diffusion-TS8GB6-10 hours150ms/sample

Optimization Techniques

  1. DDIM Sampling: Reduce steps from 1000 to 50-100 with minimal quality loss
  2. Token Merging: tomesd speeds up 1.24x
  3. Distillation: Train smaller student models
  4. Caching: Cache attention computations across steps

Resources

Papers

Implementations

Tutorials & Guides