Chapter 31: Optimal Execution with Reinforcement Learning — Beating TWAP/VWAP
Chapter 31: Optimal Execution with Reinforcement Learning — Beating TWAP/VWAP
Overview
При исполнении крупных ордеров возникает trade-off между market impact (быстрое исполнение) и timing risk (медленное исполнение). Классические алгоритмы TWAP/VWAP используют детерминистические расписания. В этой главе мы обучаем RL агента, который адаптивно выбирает оптимальную скорость исполнения.
Trading Strategy
Суть стратегии: RL агент решает, сколько акций исполнить в каждый момент времени, минимизируя implementation shortfall (разницу между decision price и execution price).
State: Order book state, remaining quantity, time left, recent price moves Action: Quantity to execute in current interval (0% to 100% of remaining) Reward: Negative implementation shortfall - penalty for risk
Edge: Адаптация к текущим market conditions vs статичное расписание
Technical Specification
Notebooks to Create
| # | Notebook | Description |
|---|---|---|
| 1 | 01_execution_theory.ipynb | Almgren-Chriss model, implementation shortfall |
| 2 | 02_market_impact_models.ipynb | Linear, square-root, transient impact models |
| 3 | 03_twap_vwap_baselines.ipynb | Реализация классических алгоритмов |
| 4 | 04_gym_environment.ipynb | Custom Gym environment для execution |
| 5 | 05_reward_function.ipynb | Дизайн reward: shortfall + risk penalty |
| 6 | 06_dqn_agent.ipynb | Deep Q-Network для дискретных actions |
| 7 | 07_ppo_agent.ipynb | PPO для continuous action space |
| 8 | 08_training_curriculum.ipynb | Curriculum learning: easy → hard scenarios |
| 9 | 09_evaluation.ipynb | Comparison vs TWAP/VWAP/Almgren-Chriss |
| 10 | 10_robustness.ipynb | Out-of-sample, different market conditions |
Data Requirements
Order Book Data:├── L2 snapshots (1-second frequency)├── Trade prints with sizes├── At least 6 months for training└── Liquid instruments (ES futures, SPY, AAPL)
Execution Scenarios:├── Order sizes: 1%, 5%, 10% of ADV├── Time horizons: 5min, 30min, 1hour, 1day├── Market conditions: normal, volatile, trending└── Simulated orders with various urgency levelsGym Environment Design
class ExecutionEnv(gym.Env): """ Environment for optimal execution """ def __init__(self, order_size, time_horizon, market_data): self.total_shares = order_size self.remaining_shares = order_size self.time_horizon = time_horizon self.current_step = 0
# Action: fraction of remaining to execute (0 to 1) self.action_space = gym.spaces.Box(0, 1, shape=(1,))
# State: [remaining_frac, time_frac, spread, volatility, imbalance, momentum] self.observation_space = gym.spaces.Box(-np.inf, np.inf, shape=(10,))
def step(self, action): execute_qty = action * self.remaining_shares execution_price = self._simulate_execution(execute_qty)
# Implementation shortfall shortfall = (execution_price - self.arrival_price) * execute_qty
# Risk penalty (variance of remaining) risk_penalty = self.lambda_risk * self.remaining_shares * self.volatility
reward = -shortfall - risk_penalty
self.remaining_shares -= execute_qty self.current_step += 1
done = (self.remaining_shares <= 0) or (self.current_step >= self.max_steps)
return self._get_state(), reward, done, {}Market Impact Model
def temporary_impact(quantity, params): """ Temporary impact: affects only current trade Linear-in-volume model """ return params['eta'] * quantity / params['adv']
def permanent_impact(quantity, params): """ Permanent impact: shifts price for future trades Square-root model (more realistic) """ return params['gamma'] * np.sign(quantity) * np.sqrt(abs(quantity) / params['adv'])
def total_cost(execution_schedule, params): """ Total execution cost = sum of (temporary + permanent) impacts """ cost = 0 price = params['initial_price'] for qty in execution_schedule: temp = temporary_impact(qty, params) perm = permanent_impact(qty, params) execution_price = price + temp + perm/2 cost += qty * (execution_price - params['arrival_price']) price += perm return costRL Algorithms
DQN (Discrete Actions):├── Actions: [0%, 10%, 20%, ..., 100%] of remaining├── Network: State → Q-values for each action├── Experience replay, target network└── Good for: interpretability, simpler training
PPO (Continuous Actions):├── Actions: continuous [0, 1] fraction├── Actor-Critic architecture├── Clipped objective for stability└── Good for: fine-grained control
SAC (Soft Actor-Critic):├── Entropy regularization├── Better exploration└── Good for: complex market dynamicsAlmgren-Chriss Baseline
def almgren_chriss_optimal(total_shares, time_horizon, risk_aversion, volatility, impact_params): """ Closed-form optimal execution trajectory Trades off impact cost vs timing risk """ kappa = np.sqrt(risk_aversion * volatility**2 / impact_params['eta'])
trajectory = [] for t in range(time_horizon): remaining_time = time_horizon - t optimal_trade = total_shares * np.sinh(kappa * remaining_time) / np.sinh(kappa * time_horizon) trajectory.append(optimal_trade)
return trajectoryKey Metrics
- Execution Quality: Implementation Shortfall, Arrival Price Slippage
- Comparison: vs TWAP, VWAP, Almgren-Chriss, Optimal IS
- Risk-Adjusted: Sharpe of execution cost distribution
- Robustness: Performance across different volatility regimes
Dependencies
gymnasium>=0.29.0stable-baselines3>=2.1.0torch>=2.0.0pandas>=1.5.0numpy>=1.23.0matplotlib>=3.6.0Expected Outcomes
- Custom Gym environment для optimal execution
- Market impact models (linear, square-root, transient)
- RL agents (DQN, PPO) обученные на execution
- Comparison framework vs classical algorithms
- Results: 5-15% improvement vs TWAP in implementation shortfall
References
- Optimal Execution of Portfolio Transactions (Almgren-Chriss, 2000)
- Optimal Execution with Reinforcement Learning
- Deep Reinforcement Learning for Optimal Execution
Difficulty Level
⭐⭐⭐⭐☆ (Advanced)
Требуется понимание: Reinforcement Learning, Market Microstructure, Execution Algorithms