Chapter 329: Probabilistic Forecasting for Trading
Chapter 329: Probabilistic Forecasting for Trading
Overview
Probabilistic forecasting goes beyond traditional point forecasts by providing full probability distributions over future outcomes. Instead of predicting “the price will be $100”, probabilistic methods predict “the price will be between $95 and $105 with 90% confidence, with the most likely value being $100”. This paradigm shift enables risk-aware decision-making, proper uncertainty quantification, and more robust trading strategies.
Why Probabilistic Forecasting for Trading?
The Problem with Point Forecasts
Traditional forecasting models output a single number (point forecast):
Input: Historical prices, featuresModel: LSTM, XGBoost, Linear RegressionOutput: Price tomorrow = $45,250 (single number)Problems:
- No uncertainty measure - How confident should we be in this prediction?
- Overconfidence - Point forecasts appear more precise than they are
- Poor risk management - Cannot properly size positions without uncertainty
- Ignores distribution shape - Asymmetric risks (fat tails) are invisible
The Probabilistic Solution
Input: Historical prices, featuresModel: DeepAR, Quantile Regression, Gaussian ProcessOutput: Full probability distribution - Mean: $45,250 - 10th percentile: $44,100 - 90th percentile: $46,500 - 95% CI: [$43,800, $46,900] - Distribution shape: Slightly right-skewedCore Concepts
1. Point vs Probabilistic Forecasts
POINT FORECAST:┌─────────────────────────────────────────┐│ │ ││ ● $45,250 ││ │ ││ Yesterday Today Tomorrow │└─────────────────────────────────────────┘
PROBABILISTIC FORECAST:┌─────────────────────────────────────────┐│ ╭───────╮ ││ ╱ 90% CI ╲ ││ ╱ ┌─────┐ ╲ ││ ╱ │ 50% │ ╲ ││ ●─────────╱───│ CI │─────╲ ││ └─────┘ ││ Yesterday Today Tomorrow ││ ││ We see: Most likely value + uncertainty │└─────────────────────────────────────────┘2. Quantile Regression
Quantile regression estimates specific percentiles of the conditional distribution:
# Standard regression: E[Y|X] = f(X) (mean only)# Quantile regression: Q_τ[Y|X] = f_τ(X) (any quantile τ)
# Example quantiles for trading:τ = 0.05 → 5th percentile (extreme downside)τ = 0.25 → 25th percentile (moderate downside)τ = 0.50 → 50th percentile (median)τ = 0.75 → 75th percentile (moderate upside)τ = 0.95 → 95th percentile (extreme upside)Quantile Loss (Pinball Loss):
L_τ(y, q) = (τ - 1{y < q}) × (y - q)
Where:- y = actual value- q = predicted quantile- τ = target quantile level- 1{y < q} = indicator function
For τ = 0.5: Equivalent to MAE (median regression)For τ = 0.9: Penalizes under-prediction more heavily3. Distributional Forecasting
Instead of predicting quantiles separately, model the entire distribution:
Parametric approach:┌────────────────────────────────────────────────┐│ Assume Y ~ Distribution(θ₁, θ₂, ...) ││ ││ Normal: Y ~ N(μ, σ²) → predict μ, σ ││ Student-t: Y ~ t(μ, σ, ν) → predict μ, σ, ν ││ Mixture: Y ~ Σᵢ πᵢ N(μᵢ, σᵢ²) ││ ││ Non-parametric: ││ - Histogram forecasts ││ - Kernel density estimation ││ - Normalizing flows │└────────────────────────────────────────────────┘4. DeepAR: Deep Autoregressive Recurrent Networks
DeepAR (Amazon) is a state-of-the-art probabilistic forecasting method:
Architecture:┌────────────────────────────────────────────────────────────┐│ DeepAR ││ ││ Encoder (LSTM/GRU): ││ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ││ │ h₁ │ → │ h₂ │ → │ h₃ │ → │ hₜ │ ││ └────┘ └────┘ └────┘ └────┘ ││ ↑ ↑ ↑ ↑ ││ [x₁,z₁] [x₂,z₂] [x₃,z₃] [xₜ,zₜ] ││ ││ Decoder (Autoregressive): ││ ┌────┐ ┌────┐ ┌────┐ ││ │ hₜ │ → │hₜ₊₁│ → │hₜ₊₂│ → ... ││ └────┘ └────┘ └────┘ ││ ↓ ↓ ↓ ││ θₜ₊₁ θₜ₊₂ θₜ₊₃ ││ ↓ ↓ ↓ ││ Sample Sample Sample → Monte Carlo forecasts ││ ││ xᵢ = target values (price, returns) ││ zᵢ = covariates (volume, indicators, time features) ││ θᵢ = distribution parameters (μ, σ for Gaussian) │└────────────────────────────────────────────────────────────┘Key Features:
- Learns from multiple related time series simultaneously
- Handles missing values and varying history lengths
- Produces coherent probabilistic forecasts via sampling
- Captures cross-series patterns (e.g., correlated assets)
Proper Scoring Rules
Why Scoring Rules Matter
To evaluate probabilistic forecasts, we need proper scoring rules - metrics that are optimized when the forecaster reports their true belief:
Proper scoring rule: S(P, y)- P = predicted probability distribution- y = observed outcome- Maximized (or minimized) when P equals true distribution
Key property: Cannot "game" the metric by reporting something other than your true beliefCRPS (Continuous Ranked Probability Score)
CRPS is the gold standard for evaluating probabilistic forecasts of continuous variables:
CRPS(F, y) = ∫_{-∞}^{∞} (F(x) - 1{y ≤ x})² dx
Where:- F(x) = predicted CDF at point x- y = observed value- 1{y ≤ x} = step function (0 before y, 1 after y)
Interpretation:- Measures the "distance" between predicted CDF and perfect forecast- Lower is better- Reduces to MAE when forecast is deterministic- Decomposes into: CRPS = Reliability + Resolution - UncertaintyCRPS Visualization:
┌────────────────────────────────────────────────────┐│ 1.0 ─────────────────────────●━━━━━━━━━━━━━━━━━━ ││ │ ╱ ││ │ ╱ ││ CDF │ ╱╱ ┌──────────────┐ ││ │ ╱╱ │ Shaded area │ ││ 0.0 ─●═══════════════╱ │ = CRPS │ ││ │ ↑ └──────────────┘ ││ └──────────────┼──────────────────────── ││ y=observed ││ ││ ── = Predicted CDF (F) ││ ── = Step function at observed value │└────────────────────────────────────────────────────┘Log Score (Logarithmic Score)
LogScore(p, y) = log(p(y))
Where:- p(y) = predicted probability density at observed value y- Higher is better (or -LogScore: lower is better)
Properties:- Heavily penalizes confident wrong predictions- Sensitive to tail behavior- Also called "negative log-likelihood" when minimizedComparison of Scoring Rules
| Scoring Rule | Sensitivity to Tails | Locality | Decomposable |
|---|---|---|---|
| CRPS | Moderate | Yes | Yes |
| Log Score | High | Yes | No |
| Brier Score | N/A (classification) | Yes | Yes |
| Quantile Score | Quantile-specific | Yes | Yes |
Calibration
What is Calibration?
A probabilistic forecast is calibrated if predicted probabilities match observed frequencies:
Calibration Definition:If we predict "80% chance of positive return" many times,then positive returns should occur ~80% of those times.
Formally: P(Y ≤ y | F(y) = p) = p for all p ∈ [0,1]PIT (Probability Integral Transform)
For a well-calibrated forecast:- Transform observations using predicted CDF: u = F(y)- u should be uniformly distributed on [0, 1]
Checking calibration:1. Compute PIT values: uᵢ = Fᵢ(yᵢ) for each prediction2. Plot histogram of uᵢ values3. Should be flat (uniform) if well-calibrated
Common issues:┌─────────────┐ ┌─────────────┐ ┌─────────────┐│ ████████ │ │ ██████ │ │ ██ ██ ││ ████████ │ │ ████████ │ │████ ████ ││ ████████ │ │ ██████████ │ │████████████ │└─────────────┘ └─────────────┘ └─────────────┘ Well-calibrated Overconfident Underconfident (uniform) (U-shaped) (inverse U)Calibration vs. Sharpness
Good forecasts need BOTH:
CALIBRATION: Probabilities are honestSHARPNESS: Predictions are precise (narrow intervals)
Trade-off:┌─────────────────────────────────────────────────┐│ ││ Calibrated but not sharp: "50% chance up/down" ││ (Always predict 50-50 → calibrated but useless)││ ││ Sharp but not calibrated: "99% confident" ││ (Always confident → precise but wrong) ││ ││ GOAL: Sharp AND calibrated │└─────────────────────────────────────────────────┘Decision Making Under Uncertainty
Expected Utility Theory
Optimal Action = argmax_a E[U(outcome | action a)]
Where:- U = utility function (e.g., profit, risk-adjusted return)- Expectation taken over predictive distribution
Example:- Predictive distribution: Price ~ N($100, $5²)- Action: Buy if expected profit > 0- Utility: U = Position × (Price - Entry) - Transaction_costsValue at Risk (VaR) from Probabilistic Forecasts
VaR_α = quantile_α of predicted return distribution
Example:Predicted return distribution for tomorrow:- Mean: +0.5%- Std: 2.0%- VaR(95%): -2.8% (5% chance of losing more than 2.8%)
Use for:- Position sizing- Risk limits- Margin requirementsExpected Shortfall (CVaR)
ES_α = E[Loss | Loss > VaR_α]
More conservative than VaR:- Averages over all losses beyond VaR- Better captures tail risk- Coherent risk measure (VaR is not)
From probabilistic forecast:ES = (1/(1-α)) × ∫_{-∞}^{VaR_α} x × f(x) dxKelly Criterion with Probabilistic Forecasts
Classic Kelly Criterion
f* = (p × b - q) / b
Where:- f* = optimal fraction of capital to bet- p = probability of winning- q = 1 - p = probability of losing- b = odds (net profit per unit bet if win)
Example:- 60% chance of +10% return (p=0.6, b=0.10)- 40% chance of -5% return (q=0.4)- Kelly fraction: (0.6 × 0.10 - 0.4 × 0.05) / 0.10 = 40%Kelly with Full Distribution
For continuous distributions:
f* = argmax_f E[log(1 + f × R)]
Where:- R = return random variable from predictive distribution- Expectation over predicted distribution- Solved numerically when distribution is complex
Practical implementation:1. Sample N returns from predictive distribution2. For each candidate f: compute mean log return3. Find f that maximizes mean log return4. Apply "fractional Kelly" (e.g., f*/2) for safetyKelly with Uncertainty in Probabilities
When probabilities themselves are uncertain:
Bayesian Kelly:- p has a posterior distribution (not a point estimate)- Integrate over uncertainty in p
f*_robust = E_p[Kelly(p)] or max_p min_outcome Kelly(p)
This naturally leads to more conservative positionswhen we're uncertain about our predictions!Model Architecture
┌─────────────────────────────────────────────────────────────────┐│ PROBABILISTIC FORECASTING MODEL │├─────────────────────────────────────────────────────────────────┤│ ││ INPUT LAYER ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Time Series Features: │ ││ │ - Historical returns (1m, 5m, 15m, 1h, 4h, 1d) │ ││ │ - Volume profile and VWAP │ ││ │ - Volatility measures (realized, implied) │ ││ │ - Order book imbalance │ ││ │ Calendar Features: │ ││ │ - Hour of day, day of week (cyclical encoding) │ ││ │ - Time since market events │ ││ └──────────────────────────────────────────────────────────┘ ││ ↓ ││ TEMPORAL ENCODER (LSTM/Transformer) ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Captures temporal patterns and dependencies │ ││ │ Output: Context vector h_t │ ││ └──────────────────────────────────────────────────────────┘ ││ ↓ ││ DISTRIBUTION HEAD ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Option A: Parametric Distribution │ ││ │ h_t → Linear → [μ, log(σ), ν, ...] │ ││ │ Return distribution: Student-t(μ, σ, ν) │ ││ │ │ ││ │ Option B: Quantile Outputs │ ││ │ h_t → Linear → [q_0.05, q_0.25, q_0.50, q_0.75, q_0.95]│ ││ │ │ ││ │ Option C: Mixture Density Network │ ││ │ h_t → Linear → [π_k, μ_k, σ_k] for k=1..K components │ ││ └──────────────────────────────────────────────────────────┘ ││ ↓ ││ OUTPUT: Full Predictive Distribution ││ ┌──────────────────────────────────────────────────────────┐ ││ │ - PDF/CDF at any point │ ││ │ - Samples via Monte Carlo │ ││ │ - Quantiles at any level │ ││ │ - Mean, variance, skewness, kurtosis │ ││ └──────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Trading Strategy
Signal Generation with Uncertainty
def generate_probabilistic_signals(forecast_distribution, current_price): """ Generate trading signals using probabilistic forecasts. """ # Extract distribution statistics mean_return = forecast_distribution.mean() std_return = forecast_distribution.std() var_95 = forecast_distribution.quantile(0.05) upside_prob = forecast_distribution.prob_greater_than(0)
# Kelly-optimal position sizing kelly_fraction = compute_kelly(forecast_distribution)
# Risk-adjusted signal signal = SignalType.HOLD position_size = 0.0
if upside_prob > 0.60 and mean_return > 0.005: # Strong bullish signal signal = SignalType.LONG # Scale position by confidence and Kelly position_size = min(kelly_fraction * 0.5, MAX_POSITION)
elif upside_prob < 0.40 and mean_return < -0.005: # Strong bearish signal signal = SignalType.SHORT position_size = min(-kelly_fraction * 0.5, MAX_POSITION)
# Adjust for forecast uncertainty uncertainty_penalty = std_return / BASELINE_VOLATILITY position_size *= max(0.5, 1.0 - uncertainty_penalty)
return Signal( direction=signal, size=position_size, confidence=abs(upside_prob - 0.5) * 2, var_95=var_95, expected_return=mean_return )Position Sizing with VaR Constraints
def size_position_with_var(forecast, risk_budget, current_portfolio): """ Size position such that portfolio VaR stays within budget. """ portfolio_var = compute_portfolio_var(current_portfolio, forecast)
max_size = risk_budget / forecast.var_95 kelly_size = compute_kelly(forecast)
# Take minimum of Kelly and VaR-constrained size position_size = min(kelly_size, max_size)
return position_sizeKey Metrics
Forecast Quality
| Metric | Formula | Interpretation |
|---|---|---|
| CRPS | See above | Lower = better probabilistic accuracy |
| Log Score | log(p(y)) | Higher = better; sensitive to tails |
| Calibration Error | F(y) - U | |
| Sharpness | Avg interval width | Lower = more precise |
| Coverage | % within 90% CI | Should equal 90% |
Trading Performance
| Metric | Target | Description |
|---|---|---|
| Sharpe Ratio | > 2.0 | Risk-adjusted return |
| Sortino Ratio | > 2.5 | Downside risk-adjusted |
| Max Drawdown | < 15% | Largest peak-to-trough |
| Win Rate | > 55% | Profitable trade percentage |
| Profit Factor | > 1.5 | Gross profit / Gross loss |
| Kelly Growth | > 10% | Annualized Kelly-optimal growth |
Implementation Details
Data Requirements
Cryptocurrency Market Data:├── OHLCV data (1-minute resolution minimum)│ └── Multiple timeframes for features├── Order book snapshots│ └── Depth and imbalance metrics├── Trade flow data│ └── Buy/sell pressure indicators└── Volatility data ├── Historical realized volatility └── Implied volatility (if available)
Feature Engineering:├── Returns at multiple horizons├── Rolling volatility measures├── Volume-weighted features├── Technical indicators (RSI, MACD, BB)└── Calendar/seasonal featuresTraining Configuration
model: type: "deepar" # or "quantile_regression", "mixture_density" hidden_size: 128 num_layers: 2 dropout: 0.1 distribution: "student_t" # or "gaussian", "negative_binomial"
quantiles: # if using quantile regression - 0.05 - 0.10 - 0.25 - 0.50 - 0.75 - 0.90 - 0.95
training: batch_size: 64 learning_rate: 0.001 epochs: 100 early_stopping_patience: 10
data: sequence_length: 120 # 2 hours of 1-minute data prediction_horizons: [5, 15, 60] # minutes ahead train_ratio: 0.7 val_ratio: 0.15 test_ratio: 0.15Advantages of Probabilistic Forecasting
| Aspect | Point Forecasts | Probabilistic Forecasts |
|---|---|---|
| Uncertainty | Hidden | Explicit |
| Risk Management | Manual VaR | Integrated |
| Position Sizing | Heuristic | Principled (Kelly) |
| Decision Making | Threshold-based | Expected utility |
| Calibration | N/A | Measurable |
| Tail Risk | Ignored | Captured |
| Model Selection | MSE/MAE only | CRPS, Log Score |
Directory Structure
329_probabilistic_forecasting/├── README.md # This file├── README.ru.md # Russian translation├── readme.simple.md # Beginner-friendly explanation├── readme.simple.ru.md # Russian beginner version├── python/ # Python implementation│ ├── requirements.txt│ ├── config.yaml│ ├── data_fetcher.py # Bybit data via CCXT│ ├── features.py # Feature engineering│ ├── models/│ │ ├── quantile_regression.py│ │ ├── deepar.py│ │ └── mixture_density.py│ ├── scoring.py # CRPS, Log Score, calibration│ ├── strategy.py # Trading strategy│ ├── backtest.py # Backtesting engine│ └── examples/│ ├── train_model.py│ ├── evaluate_forecasts.py│ └── run_backtest.py└── rust_probabilistic/ # Rust implementation ├── Cargo.toml ├── src/ │ ├── lib.rs │ ├── api/ # Bybit API client │ ├── distributions/ # Probability distributions │ ├── models/ # Forecasting models │ ├── scoring/ # Proper scoring rules │ ├── strategy/ # Trading strategy │ └── backtest/ # Backtesting engine └── examples/ ├── fetch_data.rs ├── quantile_forecast.rs └── backtest.rsReferences
-
Probabilistic Forecasting (Gneiting & Katzfuss, 2014)
- Annual Review of Statistics: Foundations of probabilistic forecasting
-
DeepAR: Probabilistic Forecasting with Autoregressive RNNs (Salinas et al., 2020)
-
Strictly Proper Scoring Rules, Prediction, and Estimation (Gneiting & Raftery, 2007)
- Journal of the American Statistical Association
-
Quantile Regression (Koenker, 2005)
- Cambridge University Press
-
Kelly Criterion in Portfolio Optimization (MacLean et al., 2011)
- The Kelly Capital Growth Investment Criterion
-
Calibration of Probabilistic Forecasts (Gneiting et al., 2007)
Difficulty Level
Advanced - Requires understanding of:
- Probability theory and statistics
- Time series analysis
- Bayesian inference concepts
- Risk management principles
- Deep learning architectures
Disclaimer
This chapter is for educational purposes only. Cryptocurrency trading involves substantial risk. The strategies described here have not been validated in live trading and should be thoroughly tested before any real-world application. Past performance does not guarantee future results. Probabilistic forecasts provide uncertainty estimates but cannot eliminate risk.