Chapter 309: Reward Shaping for Trading

Introduction

Reinforcement learning (RL) has emerged as a powerful paradigm for building autonomous trading agents. However, one of the most critical and often overlooked challenges in applying RL to financial markets is the design of the reward function. A naive reward signal — such as raw profit and loss (PnL) — can lead to agents that overfit to specific market conditions, take excessive risks, or fail to learn meaningful policies due to sparse and delayed feedback.

Reward shaping addresses this problem by augmenting the base reward signal with additional terms that guide the agent toward desirable behaviors without changing the optimal policy. The key insight from the seminal work of Ng, Harada, and Russell (1999) is that potential-based reward shaping preserves policy invariance: the optimal policy under the shaped reward is identical to the optimal policy under the original reward, but learning can be dramatically faster.

In the context of trading, reward shaping allows us to encode domain knowledge — such as preferences for risk-adjusted returns, aversion to drawdowns, and sensitivity to transaction costs — directly into the learning signal. This chapter explores the theory behind reward shaping, presents several practical reward designs for trading, and provides a complete Rust implementation with integration to Bybit exchange data.

Mathematical Foundations

The Reward Shaping Framework

Consider a Markov Decision Process (MDP) defined by the tuple (S, A, T, R, gamma), where S is the state space, A is the action space, T is the transition function, R is the reward function, and gamma is the discount factor.

A shaped reward function R’ is defined as:

R'(s, a, s') = R(s, a, s') + F(s, a, s')

where F is the shaping function that provides additional feedback to the agent.

Potential-Based Shaping

The critical theoretical result is that potential-based shaping guarantees policy invariance. The shaping function takes the form:

F(s, a, s') = gamma * Phi(s') - Phi(s)

where Phi: S -> R is a real-valued potential function defined over states. This formulation ensures that:

Policy Invariance Theorem: For any MDP M and any potential function Phi, the optimal policy under the shaped reward R’ = R + F is identical to the optimal policy under the original reward R. This holds because the potential terms telescope across trajectories, contributing only boundary terms that do not affect the relative ordering of policies.
Convergence Guarantee: Q-learning and SARSA with potential-based shaping converge to the same optimal Q-values (up to a constant offset) as without shaping. Specifically, Q’(s, a) = Q(s, a) - Phi(s), where Q’* and Q* are the optimal Q-values under shaped and unshaped rewards respectively.
Faster Learning: While the optimal policy is preserved, the shaped reward can dramatically reduce the number of episodes needed for convergence by providing denser feedback and reducing the variance of reward estimates.

Proof Sketch of Policy Invariance

For any trajectory tau = (s_0, a_0, s_1, a_1, …, s_T), the cumulative shaped reward differs from the cumulative original reward by:

Sum_{t=0}^{T-1} F(s_t, a_t, s_{t+1}) = Sum_{t=0}^{T-1} [gamma * Phi(s_{t+1}) - Phi(s_t)]

This telescoping sum equals:

gamma^T * Phi(s_T) - Phi(s_0) + Sum_{t=1}^{T-1} (gamma - 1) * Phi(s_t) * [correction terms]

In the discounted infinite-horizon case, this reduces to a constant offset that depends only on the initial state, not on the policy. Therefore, argmax over policies is preserved.

Non-Potential-Based Shaping and Risks

If the shaping function F is not derived from a potential, policy invariance is not guaranteed. Common pitfalls include:

Positive reward cycles: The agent may find loops in state space that accumulate shaping reward without making progress.
Policy distortion: The optimal policy under shaped rewards may differ from the true optimal policy.
Convergence to suboptimal equilibria: The agent may settle on behaviors that exploit the shaping reward rather than optimizing the true objective.

In practice, many useful trading reward modifications (such as drawdown penalties) are not strictly potential-based. We address this by analyzing the trade-offs and providing guidelines for when non-potential-based shaping is acceptable.

Reward Designs for Trading

1. Base Reward: Raw PnL

The simplest reward is the change in portfolio value:

R_pnl(t) = V(t) - V(t-1)

where V(t) is the portfolio value at time t. While intuitive, this reward:

Provides no risk adjustment
Can lead to high-variance policies
Does not penalize drawdowns or excessive trading

2. Sharpe Ratio-Based Shaping

The Sharpe ratio measures risk-adjusted returns. We can incorporate it through a potential function:

Phi_sharpe(s) = k * rolling_sharpe(s, window)

where rolling_sharpe computes the Sharpe ratio over a recent window of returns, and k is a scaling constant. The shaping reward becomes:

F_sharpe(s, s') = gamma * k * sharpe(s') - k * sharpe(s)

This encourages the agent to transition to states with higher risk-adjusted returns. When the rolling Sharpe ratio improves, the agent receives a positive shaping bonus; when it deteriorates, the agent is penalized.

Design considerations:

Window size affects sensitivity: shorter windows (20-50 steps) are more responsive but noisier
The scaling factor k should be calibrated relative to the magnitude of PnL rewards
Annualization factor: multiply by sqrt(252) for daily data, sqrt(252 * 24 * 60) for minute data

3. Drawdown Penalty Shaping

Maximum drawdown is a critical risk metric for traders. We define a drawdown penalty potential:

Phi_dd(s) = -lambda * max(0, peak_value(s) - current_value(s)) / peak_value(s)

where lambda controls the severity of the penalty and peak_value(s) is the historical peak portfolio value. The resulting shaping signal penalizes the agent when drawdown increases and rewards recovery from drawdowns.

Properties:

The penalty grows as the portfolio moves further from its peak
Recovery toward the peak generates positive shaping reward
The parameter lambda allows fine-tuning the risk-return trade-off
Note: this is approximately potential-based when the peak value changes slowly

4. Transaction Cost Penalty

Excessive trading erodes returns through commissions and slippage. A transaction cost shaping term:

F_tc(s, a, s') = -c * |position_change(a)|

where c is the cost per unit traded. This is not potential-based (it depends on the action), but it directly models the real-world cost structure and is widely used in practice.

Extensions:

Proportional costs: c * |position_change| * price
Tiered fee structures: different rates for maker vs taker orders
Slippage modeling: costs that increase with order size relative to market depth

5. Regime-Aware Shaping

Market regimes (trending, mean-reverting, volatile) affect optimal trading strategies. A regime-aware potential function:

Phi_regime(s) = w_regime(s) * base_potential(s)

where w_regime(s) is a regime-dependent weight. For example:

In trending regimes: increase weight on momentum-following rewards
In mean-reverting regimes: increase weight on contrarian rewards
In high-volatility regimes: increase weight on risk penalties

Regime detection can be based on rolling volatility, moving average crossovers, or hidden Markov models.

6. Composite Reward Function

In practice, we combine multiple shaping terms:

R_total(s, a, s') = R_pnl(s, a, s') + alpha * F_sharpe(s, s') + beta * F_dd(s, s') + delta * F_tc(s, a, s')

where alpha, beta, and delta are hyperparameters controlling the relative importance of each component. These can be tuned via:

Grid search over validation episodes
Multi-objective optimization (Pareto frontier analysis)
Adaptive weighting based on training progress

Rust Implementation

The implementation in rust/src/lib.rs provides:

RewardShaper trait: A common interface for all reward shaping strategies, with methods for computing shaped rewards and updating internal state.
PnlReward: Base PnL reward computation from price changes and positions.
SharpeShapingReward: Potential-based shaping using rolling Sharpe ratio, with configurable window size and scaling factor.
DrawdownPenaltyReward: Drawdown-based shaping that tracks peak portfolio value and penalizes deviations.
TransactionCostPenalty: Action-dependent cost modeling with configurable fee rates.
CompositeReward: Combines multiple reward components with weighted aggregation.
QLearningAgent: A tabular Q-learning agent that supports shaped rewards, with epsilon-greedy exploration and configurable learning parameters.
BybitClient: HTTP client for fetching historical kline data from the Bybit API.

The implementation emphasizes:

Type safety through Rust’s strong type system
Zero-cost abstractions for reward composition
Efficient state discretization for Q-learning
Comprehensive unit testing

Bybit Data Integration

The system fetches historical OHLCV data from Bybit’s public API:

GET https://api.bybit.com/v5/market/kline?category=linear&symbol=BTCUSDT&interval=60&limit=200

The data pipeline:

Fetch kline data with configurable symbol, interval, and limit
Parse JSON response into typed Rust structures
Compute derived features (returns, volatility, moving averages)
Feed into the trading simulation environment
Run Q-learning episodes with different reward shaping configurations

Experimental Comparison

The trading example (rust/examples/trading_example.rs) demonstrates the impact of reward shaping by training three agents:

Raw PnL Agent: Uses only price-change reward. Tends to take aggressive positions, high variance in performance.
Sharpe-Shaped Agent: Adds rolling Sharpe ratio potential. Learns smoother equity curves, better risk-adjusted returns.
Drawdown-Penalized Agent: Adds drawdown penalty. More conservative, smaller maximum drawdown, potentially lower total return but better Sortino ratio.

The comparison metrics include:

Total return
Sharpe ratio
Maximum drawdown
Sortino ratio
Number of trades (turnover)
Win rate

Key Takeaways

Reward shaping is essential for practical RL trading systems. Raw PnL alone provides insufficient guidance for learning risk-aware policies.
Potential-based shaping preserves optimality. The policy invariance theorem guarantees that potential-based shaping functions do not alter the optimal policy while accelerating learning.
Domain knowledge matters. Effective reward shaping encodes financial intuition — Sharpe ratios, drawdown aversion, transaction costs — into the learning signal.
Composite rewards outperform individual components. Combining multiple shaping terms with carefully tuned weights produces the most robust trading agents.
Non-potential-based shaping requires caution. Transaction cost penalties and some risk adjustments are not strictly potential-based. Monitor for policy distortion and validate against unshaped baselines.
Hyperparameter sensitivity. The weights on shaping terms (alpha, beta, delta) significantly affect learned behavior. Use systematic tuning with out-of-sample validation.
Regime awareness improves adaptability. Adjusting reward shaping based on detected market regimes helps agents adapt to changing market conditions.
Rust provides performance advantages. The computational demands of RL training — many episodes, many time steps — benefit from Rust’s zero-cost abstractions and efficient memory management, enabling faster experimentation cycles.

References

Ng, A. Y., Harada, D., & Russell, S. J. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. ICML.
Devlin, S., & Kudenko, D. (2012). Dynamic potential-based reward shaping. AAMAS.
Moody, J., & Saffell, M. (2001). Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks.
Almahdi, S., & Yang, S. Y. (2017). An adaptive portfolio trading system: A risk-return portfolio optimization using recurrent reinforcement learning with expected maximum drawdown. Expert Systems with Applications.
Spooner, T., Fearnley, J., Savani, R., & Kouroupi, A. (2018). Market making via reinforcement learning. AAMAS.