Chapter 312: Planning RL Trading

Introduction

Traditional model-free reinforcement learning (RL) approaches to trading---such as DQN, PPO, or SAC---learn policies directly from interactions with the market environment. While effective, they are notoriously sample-inefficient: they require millions of transitions to converge and cannot reason about the consequences of actions before taking them. In contrast, model-based RL with explicit planning builds an internal model of the environment (a “world model”) and uses it to simulate future trajectories, evaluate candidate action sequences, and select the best plan before executing a single real trade.

This paradigm draws inspiration from how expert human traders operate. A skilled trader does not simply react to the current price; they mentally simulate scenarios: “If I place a large buy order here, the market will move against me; if I spread the order over 10 minutes, the impact will be smaller, but I risk the price moving away.” Planning RL formalizes this reasoning loop.

The key components of planning RL for trading are:

World Model Learning — A neural network that predicts the next market state and reward given the current state and action.
Dyna-Q Architecture — An agent that augments real experience with simulated rollouts from the learned world model.
Model Predictive Control (MPC) — A planning algorithm that optimizes action sequences by simulating future trajectories.
Uncertainty Estimation — Quantifying model confidence to avoid catastrophic decisions based on inaccurate predictions.

Mathematical Foundation

World Model Learning

A world model consists of two learned functions:

Transition Model: $$\hat{s}{t+1} = f\theta(s_t, a_t)$$

Reward Model: $$\hat{r}t = g\phi(s_t, a_t)$$

where $s_t$ is the market state at time $t$ (prices, volumes, indicators), $a_t$ is the trading action (buy/sell/hold with position sizing), and $\theta, \phi$ are learnable parameters.

The transition model is trained by minimizing: $$\mathcal{L}{\text{trans}}(\theta) = \mathbb{E}{(s_t, a_t, s_{t+1}) \sim \mathcal{D}} \left[ | s_{t+1} - f_\theta(s_t, a_t) |^2 \right]$$

The reward model is trained similarly: $$\mathcal{L}{\text{reward}}(\phi) = \mathbb{E}{(s_t, a_t, r_t) \sim \mathcal{D}} \left[ (r_t - g_\phi(s_t, a_t))^2 \right]$$

where $\mathcal{D}$ is the replay buffer of real transitions.

Dyna-Q Architecture

Dyna-Q, introduced by Sutton (1991), augments standard Q-learning with model-based rollouts. The algorithm alternates between:

Real Experience: Execute action $a_t$ in the real environment, observe $(s_t, a_t, r_t, s_{t+1})$, and update Q-values.
Simulated Experience: Sample $k$ previously visited states, use the world model to generate synthetic transitions, and update Q-values on these as well.

The Q-value update (for both real and simulated transitions) is: $$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_{a’} Q(s_{t+1}, a’) - Q(s_t, a_t) \right]$$

By performing $k$ simulated updates per real step, Dyna-Q achieves $k$-fold improvement in sample efficiency. For trading, where each real interaction involves actual capital, this efficiency gain is critical.

Model Predictive Control (MPC)

MPC is a planning algorithm that optimizes over a finite horizon $H$:

$$a_t^* = \arg\max_{a_{t:t+H}} \sum_{h=0}^{H-1} \gamma^h \hat{r}_{t+h}$$

where $\hat{r}{t+h}$ and $\hat{s}{t+h+1}$ are generated by rolling out the world model:

$$\hat{s}{t+h+1} = f\theta(\hat{s}{t+h}, a{t+h}), \quad \hat{r}{t+h} = g\phi(\hat{s}{t+h}, a{t+h})$$

Only the first action $a_t^*$ is executed; the process repeats at the next timestep (receding horizon).

Random Shooting MPC: Sample $N$ random action sequences of length $H$, evaluate each by rolling out the world model, and select the sequence with the highest cumulative reward.

Cross-Entropy Method (CEM): An iterative refinement of random shooting:

Initialize a Gaussian distribution over action sequences: $\mu_0, \sigma_0$.
Sample $N$ action sequences from $\mathcal{N}(\mu, \sigma)$.
Evaluate each sequence via world model rollouts.
Select the top-$K$ (elite) sequences.
Refit $\mu, \sigma$ to the elite set.
Repeat for $M$ iterations.

CEM converges to better solutions than pure random shooting, especially in high-dimensional action spaces.

Model Uncertainty Estimation

A single deterministic world model can produce overconfident predictions in unfamiliar regions of state space. To address this, we use an ensemble of $B$ world models ${f_{\theta_1}, \ldots, f_{\theta_B}}$ and estimate uncertainty as:

$$\text{Var}[\hat{s}{t+1}] = \frac{1}{B} \sum{b=1}^{B} \left( f_{\theta_b}(s_t, a_t) - \bar{f}(s_t, a_t) \right)^2$$

where $\bar{f}$ is the ensemble mean. When uncertainty exceeds a threshold, the agent should prefer conservative actions or fall back to model-free behavior.

Applications in Trading

Multi-Step Trade Planning

Planning RL enables sophisticated multi-step strategies:

Entry/Exit Optimization: Instead of making isolated buy/sell decisions, the agent plans a complete trade lifecycle: entry point, position scaling, profit targets, and stop-loss placement.
Portfolio Rebalancing: The agent simulates the impact of rebalancing actions over multiple steps to minimize transaction costs while tracking target allocations.
Spread Trading: Planning across correlated assets to identify and exploit temporary mispricings.

Market Impact Modeling

For institutional-size orders, the world model explicitly captures market impact:

Temporary Impact: The immediate price displacement caused by an order, proportional to $\sqrt{V/\text{ADV}}$ where $V$ is order volume and ADV is average daily volume.
Permanent Impact: The lasting shift in the equilibrium price due to information leakage.
Decay Dynamics: How temporary impact decays over time.

By incorporating impact into the world model, the MPC planner can optimize execution schedules (similar to Almgren-Chriss but adaptive and learned from data).

Regime-Aware Planning

The world model can learn regime-dependent dynamics:

In trending markets, plan momentum-following strategies with longer horizons.
In mean-reverting markets, plan contrarian entries with shorter horizons.
In high-volatility regimes, reduce planning horizon and position size.

Rust Implementation

The implementation in rust/src/lib.rs provides:

WorldModel: A neural-network-inspired transition and reward model with backpropagation-based training.
DynaQAgent: A tabular Q-learning agent augmented with model-based rollouts via the world model.
MPCPlanner: A Model Predictive Control planner supporting both random shooting and CEM optimization.
ModelEnsemble: An ensemble of world models for uncertainty estimation.
BybitClient: An HTTP client for fetching live kline data from the Bybit exchange.

Architecture

Market Data (Bybit API)
        |
        v
  World Model Learning
        |
        v
  Planning (MPC / Dyna-Q)
        |
        v
  Action Selection
        |
        v
  Execution & Feedback

The world model is trained on historical transitions (state, action, reward, next_state) constructed from OHLCV data. States include normalized price returns, volume ratios, and technical indicators. Actions are discretized into {Buy, Hold, Sell} with continuous position sizing.

Bybit Data Integration

The implementation connects to Bybit’s public REST API to fetch historical kline (candlestick) data:

GET https://api.bybit.com/v5/market/kline?category=linear&symbol=BTCUSDT&interval=60&limit=200

The response provides OHLCV data which is transformed into trading states:

Price features: Returns, log-returns, moving average crossovers
Volume features: Volume relative to moving average
Volatility features: Rolling standard deviation of returns

This real data feeds both the world model training pipeline and the evaluation of planned actions against actual market outcomes.

Running the Example

cd 312_planning_rl_trading/rust
cargo build
cargo run --example trading_example

The example:

Fetches BTCUSDT hourly data from Bybit
Constructs state representations from OHLCV data
Trains a world model on historical transitions
Runs MPC planning to select optimal actions
Compares model-based planning against a model-free baseline
Reports cumulative returns and Sharpe ratios

Key Takeaways

Sample Efficiency: Model-based RL with planning can achieve the same performance as model-free methods with 10-100x fewer real environment interactions. For trading, where each interaction involves real money, this is transformative.
Look-Ahead Capability: MPC enables the agent to reason about multi-step consequences before acting. This is especially valuable for execution optimization and avoiding adverse market impact.
Dyna-Q as a Bridge: The Dyna architecture provides a natural bridge between model-free and model-based RL. By tuning the number of simulated rollouts per real step, practitioners can smoothly trade off computation for sample efficiency.
Uncertainty Matters: Ensemble-based uncertainty estimation is not optional---it is essential. Without it, model errors compound during rollouts, leading to catastrophically overconfident plans.
Planning Horizon Trade-off: Longer planning horizons enable more sophisticated strategies but are more sensitive to model errors. In practice, horizons of 5-20 steps work well for hourly trading; shorter horizons are preferred for higher frequencies.
Market Impact Integration: One of the most compelling applications of planning RL is incorporating learned market impact models into the planning loop, enabling intelligent execution that model-free methods cannot achieve.
Regime Adaptivity: The world model naturally captures regime-dependent dynamics. As the model is updated with recent data, the planner’s behavior adapts to the current market regime without explicit regime detection.

References

Sutton, R. S. (1991). “Dyna, an Integrated Architecture for Learning, Planning, and Reacting.” SIGART Bulletin.
Chua, K., et al. (2018). “Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models.” NeurIPS.
Hafner, D., et al. (2020). “Dream to Control: Learning Behaviors by Latent Imagination.” ICLR.
Almgren, R. & Chriss, N. (2001). “Optimal Execution of Portfolio Transactions.” Journal of Risk.
Schrittwieser, J., et al. (2020). “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.” Nature.