Chapter 303: Dreamer Trading

Introduction: DreamerV2/V3 - World Model Reinforcement Learning in Latent Space

Traditional reinforcement learning (RL) for trading faces a fundamental challenge: sample inefficiency. Learning directly from market interactions is expensive, slow, and risky. Every exploratory trade costs real money, and the feedback loop between action and outcome can span hours, days, or weeks. This is where Dreamer - a family of world model-based RL agents - offers a paradigm shift.

Dreamer, introduced through its successive versions DreamerV1 (Hafner et al., 2020), DreamerV2 (Hafner et al., 2021), and DreamerV3 (Hafner et al., 2023), learns a compact world model of the environment and then trains its policy entirely within imagination - simulated trajectories generated by the learned model. Instead of interacting with real markets to learn, the agent first understands how markets behave, then practices trading strategies in its mental simulation.

The architecture is built on three core components:

World Model: Learns to predict future observations and rewards from past data using a Recurrent State Space Model (RSSM). The world model compresses raw market observations into a compact latent representation.
Actor (Policy): A neural network that selects actions (buy, sell, hold, position sizing) based on the current latent state. Crucially, the actor is trained purely on imagined trajectories - sequences generated by the world model without any real market interaction.
Critic (Value Function): Estimates expected future returns from any latent state, providing learning signals to the actor during imagination.

This approach is especially compelling for financial markets because:

Data efficiency: Markets provide limited non-stationary data. Learning a world model and training in imagination multiplies the effective sample size.
Risk-free exploration: The agent can explore aggressive strategies in imagination without risking capital.
Regime modeling: The world model can capture different market regimes (trending, mean-reverting, volatile) in its latent space.
Multi-step planning: Imagined rollouts enable the agent to consider long-horizon consequences of trading decisions.

DreamerV3 specifically introduced several key improvements: symlog predictions for handling varying scales (critical for financial data), free bits for KL balancing, and a universal hyperparameter configuration that works across diverse domains - making it particularly suitable for the heterogeneous nature of financial markets.

Mathematical Foundations

Recurrent State Space Model (RSSM)

The heart of Dreamer is the RSSM, which maintains a latent state with both deterministic and stochastic components. This dual structure allows the model to capture both predictable market dynamics and inherent uncertainty.

The RSSM consists of four components:

Sequence Model (Deterministic Path): $$h_t = f_\phi(h_{t-1}, z_{t-1}, a_{t-1})$$

where $h_t$ is the deterministic recurrent state (capturing long-term dependencies like trends), $z_{t-1}$ is the previous stochastic state, and $a_{t-1}$ is the previous action. In practice, $f_\phi$ is implemented as a GRU cell.

Encoder (Posterior): $$z_t \sim q_\phi(z_t | h_t, x_t)$$

Given the current deterministic state and the actual observation $x_t$ (market data), the encoder produces the posterior distribution over the stochastic state. This is used during training when real data is available.

Dynamics Predictor (Prior): $$\hat{z}t \sim p\phi(\hat{z}_t | h_t)$$

The prior predicts the stochastic state from the deterministic state alone, without access to the actual observation. This is used during imagination when generating trajectories without real data.

Decoder (Observation Model): $$\hat{x}t \sim p\phi(\hat{x}_t | h_t, z_t)$$

Reconstructs market observations from the full latent state $(h_t, z_t)$.

Reward Predictor: $$\hat{r}t \sim p\phi(\hat{r}_t | h_t, z_t)$$

Predicts the reward (trading PnL) from the latent state.

In DreamerV2/V3, the stochastic state $z_t$ uses categorical distributions rather than Gaussian, which was found to be more expressive and stable. Specifically, $z_t$ is represented as a vector of independent categorical variables, each with a finite number of classes.

World Model Training: Evidence Lower Bound (ELBO)

The world model is trained by maximizing the evidence lower bound on the log-likelihood of observed sequences:

$$\mathcal{L}(\phi) = \sum_{t=1}^{T} \Big[ \underbrace{\ln p_\phi(x_t | h_t, z_t)}{\text{reconstruction}} + \underbrace{\ln p\phi(r_t | h_t, z_t)}{\text{reward prediction}} - \underbrace{\beta \cdot D{KL}[q_\phi(z_t | h_t, x_t) | p_\phi(z_t | h_t)]}_{\text{KL regularization}} \Big]$$

The KL divergence term serves two purposes:

Regularizes the posterior: Prevents overfitting by keeping the learned representations close to the prior.
Trains the prior: Ensures the dynamics predictor can accurately forecast future states.

KL Balancing

A key innovation in DreamerV2/V3 is KL balancing, which splits the KL loss into two parts with different learning rates:

$$\mathcal{L}{KL} = \alpha \cdot D{KL}[\text{sg}(q_\phi) | p_\phi] + (1 - \alpha) \cdot D_{KL}[q_\phi | \text{sg}(p_\phi)]$$

where $\text{sg}(\cdot)$ denotes stop-gradient. With $\alpha > 0.5$ (typically $\alpha = 0.8$), the prior is updated more aggressively to match the posterior, while the posterior is given more freedom to represent the true data distribution. This prevents the common failure mode where the posterior collapses to the prior too early.

DreamerV3 also introduces free bits: the KL loss is only applied when it exceeds a threshold $\tau$ (typically $\tau = 1$ nat), preventing over-regularization of useful information:

$$\mathcal{L}{KL}^{\text{free}} = \max(\mathcal{L}{KL}, \tau)$$

Symlog Predictions (DreamerV3)

Financial data spans multiple orders of magnitude (price levels, volumes, returns). DreamerV3 addresses this with the symlog transformation:

$$\text{symlog}(x) = \text{sign}(x) \cdot \ln(|x| + 1)$$

$$\text{symexp}(x) = \text{sign}(x) \cdot (\exp(|x|) - 1)$$

All predictions (observations, rewards, values) are made in symlog space, which compresses the range while preserving sign information. This is particularly valuable for trading where returns can vary dramatically across assets and time periods.

Actor-Critic in Imagination

Once the world model is trained, the actor and critic are trained entirely within imagined trajectories:

Imagination: Starting from real latent states, the world model generates trajectories of length $H$ (imagination horizon): $$\hat{z}t \sim p\phi(\hat{z}t | h_t), \quad h{t+1} = f_\phi(h_t, \hat{z}t, a_t), \quad a_t \sim \pi\psi(a_t | h_t, \hat{z}_t)$$
Value Estimation: The critic estimates returns using $\lambda$-returns: $$V_t^\lambda = \hat{r}t + \gamma \Big[(1-\lambda) v\xi(h_{t+1}, \hat{z}{t+1}) + \lambda V{t+1}^\lambda \Big]$$
Actor Update: The actor maximizes expected imagined returns: $$\max_\psi \mathbb{E}\Big[\sum_{t=0}^{H} V_t^\lambda\Big]$$
Critic Update: The critic minimizes prediction error on the $\lambda$-returns: $$\min_\xi \mathbb{E}\Big[\sum_{t=0}^{H} (v_\xi(h_t, \hat{z}_t) - \text{sg}(V_t^\lambda))^2\Big]$$

Applications: Trading in Imagined Market Scenarios

Data-Efficient Strategy Learning

The Dreamer framework transforms trading strategy development in several key ways:

1. Sample Amplification: From a limited dataset of, say, 1 year of minute-level data (~525,600 observations), the world model can generate millions of imagined trajectories. Each trajectory explores different action sequences, effectively multiplying the training data.

2. Regime-Aware Strategies: The stochastic component of RSSM naturally captures market regime uncertainty. When the model is uncertain (high entropy in $p_\phi(z_t|h_t)$), it generates diverse imagined futures, leading to robust policies that perform well across regimes.

3. Multi-Asset Portfolio Management: The latent space can jointly model correlations between assets. Imagination allows exploring portfolio rebalancing strategies across thousands of correlated scenarios.

4. Risk Management Through Imagination: By generating many imagined trajectories from the current state, the agent can estimate tail risks and adjust positions accordingly - essentially performing Monte Carlo simulation within its learned world model.

Trading-Specific Adaptations

For financial markets, several modifications to the standard Dreamer framework are beneficial:

Action Space: Continuous position sizing in $[-1, 1]$ representing short-to-long exposure, discretized into categories for the actor.
Reward Shaping: Log returns with transaction cost penalties and drawdown regularization.
Observation Space: OHLCV data, technical indicators, order book features, normalized by symlog.
State Augmentation: Include portfolio state (current position, unrealized PnL) in the observation.

Comparison with Model-Free Approaches

Aspect	Model-Free (PPO/SAC)	Dreamer
Sample Efficiency	Low - needs millions of steps	High - learns from imagination
Risk During Training	High - learns from real trades	Low - trains in imagination
Adaptation Speed	Slow - needs new data	Fast - updates world model
Interpretability	Black box actions	Can inspect world model predictions
Computational Cost	Low per step	Higher - world model + imagination

Rust Implementation

The implementation provides a simplified but functional Dreamer architecture for trading:

use dreamer_trading::{
    DreamerConfig, DreamerAgent, WorldModel,
    BybitClient, MarketData
};

// Configure the Dreamer agent
let config = DreamerConfig {
    deterministic_size: 200,
    stochastic_size: 30,
    num_categories: 32,
    hidden_size: 200,
    imagination_horizon: 15,
    kl_balance_alpha: 0.8,
    free_bits: 1.0,
    discount: 0.99,
    lambda_gae: 0.95,
    learning_rate: 3e-4,
};

// Fetch real market data
let client = BybitClient::new();
let data = client.fetch_klines("BTCUSDT", "5", 1000).await?;

// Train the world model on real data
let mut agent = DreamerAgent::new(config);
agent.train_world_model(&data, epochs: 100);

// Learn policy in imagination (no real market interaction!)
agent.train_policy_in_imagination(num_trajectories: 10000);

// Evaluate on held-out real data
let metrics = agent.evaluate(&test_data);

The Rust implementation focuses on:

RSSM with deterministic (GRU-like) and stochastic (categorical) states
World model training with reconstruction + KL loss
Imagination rollouts for actor-critic training
Bybit API integration for real BTCUSDT data

See rust/src/lib.rs for the full implementation and rust/examples/trading_example.rs for a complete workflow.

Bybit Data Integration

The implementation connects to Bybit’s public API to fetch real market data:

let client = BybitClient::new();

// Fetch 1000 5-minute klines for BTCUSDT
let klines = client.fetch_klines("BTCUSDT", "5", 1000).await?;

// Each kline contains: timestamp, open, high, low, close, volume
for k in &klines {
    println!("Time: {}, OHLCV: {}/{}/{}/{} Vol: {}",
        k.timestamp, k.open, k.high, k.low, k.close, k.volume);
}

The data pipeline:

Fetches raw OHLCV data from Bybit REST API
Normalizes features using symlog transformation
Creates sequences for RSSM training
Splits into training/validation/test sets

Key Takeaways

Dreamer learns a world model of markets and trains trading policies entirely in imagination, eliminating the need for costly real-market exploration during training.
The RSSM architecture with deterministic + stochastic states captures both predictable trends (deterministic path) and market uncertainty (stochastic state), providing a rich latent representation.
KL balancing and free bits prevent common training failures where the model either ignores observations or fails to learn useful dynamics - critical for the noisy, non-stationary nature of financial data.
Symlog predictions handle the multi-scale nature of financial data (prices, volumes, returns) without manual normalization tuning.
Imagination-based training is inherently risk-free: the agent can explore aggressive strategies, experience drawdowns, and learn from crashes without any real capital at risk.
Data efficiency is the killer feature for trading: markets provide limited non-repeatable data, and Dreamer’s ability to generate unlimited imagined experience from a learned model is uniquely suited to this constraint.
The world model provides interpretability: by examining the model’s predictions and latent states, traders can understand what the agent “thinks” the market will do, building trust in the system.
DreamerV3’s universal hyperparameters reduce the need for per-asset or per-market tuning, enabling faster deployment across diverse trading instruments.

References

Hafner, D., et al. (2020). “Dream to Control: Learning Behaviors by Latent Imagination.” ICLR 2020.
Hafner, D., et al. (2021). “Mastering Atari with Discrete World Models.” ICLR 2021.
Hafner, D., et al. (2023). “Mastering Diverse Domains through World Models.” arXiv:2301.04104.
Ha, D. & Schmidhuber, J. (2018). “World Models.” NeurIPS 2018.
Schrittwieser, J., et al. (2020). “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.” Nature.