Chapter 268: Spread Modeling with ML for Trading
Chapter 268: Spread Modeling with ML for Trading
Introduction
The bid-ask spread is one of the most fundamental quantities in market microstructure. It represents the cost of immediacy: the premium a trader pays to execute a transaction right now rather than waiting for a counterparty to arrive. For market makers, the spread is revenue; for institutional investors, it is an execution cost that erodes alpha. Understanding, predicting, and decomposing the spread is therefore critical for both sides of the liquidity provision equation.
Traditional microstructure theory provides elegant models for spread behavior. The Roll (1984) model infers the effective spread from the serial covariance of trade-price changes. The Glosten-Milgrom (1985) and Kyle (1985) models decompose the spread into adverse selection, inventory risk, and order processing components. However, these models rely on stylized assumptions — normally distributed arrivals, constant information asymmetry, single-asset settings — that rarely hold in modern electronic markets.
Machine learning offers a powerful toolkit for extending these classical ideas. ML models can capture nonlinear relationships between spread determinants, adapt to regime changes in volatility and liquidity, and incorporate high-dimensional features from the limit order book. In this chapter, we develop a complete framework for spread modeling with ML, covering spread calculation, decomposition, prediction, and application to trading strategy evaluation.
We implement everything in Rust for performance, and demonstrate integration with Bybit’s API for real-time cryptocurrency orderbook data. The techniques apply broadly to equities, futures, FX, and any market with a visible order book.
Mathematical Foundations
Spread Definitions
The quoted spread is the difference between the best ask and best bid prices at time $t$:
$$S_t^{quoted} = P_t^{ask} - P_t^{bid}$$
The relative spread (or proportional spread) normalizes by the midpoint:
$$S_t^{relative} = \frac{P_t^{ask} - P_t^{bid}}{M_t}, \quad M_t = \frac{P_t^{ask} + P_t^{bid}}{2}$$
The effective spread measures actual execution costs using trade prices:
$$S_t^{effective} = 2 \cdot D_t \cdot (P_t^{trade} - M_t)$$
where $D_t \in {+1, -1}$ is the trade direction indicator (buy or sell).
Roll Spread Estimator
Roll (1984) showed that under the assumption of an efficient market with a constant spread, the effective spread can be estimated from the autocovariance of price changes. Let $\Delta P_t = P_t - P_{t-1}$ be the price change. Then:
$$\text{Cov}(\Delta P_t, \Delta P_{t-1}) = -c^2$$
where $c$ is the half-spread. The Roll estimator is:
$$\hat{S}{Roll} = 2\hat{c} = 2\sqrt{-\text{Cov}(\Delta P_t, \Delta P{t-1})}$$
When the autocovariance is positive (which happens in trending markets), the Roll estimator is undefined. In practice, we set $\hat{S}_{Roll} = 0$ in that case or use the absolute value with a sign adjustment.
Spread Decomposition
The spread can be decomposed into three economic components:
-
Adverse selection component ($\alpha$): Compensation for trading with informed traders who possess private information. When a market maker transacts with an informed trader, the price moves against them. This component is larger when information asymmetry is high — around earnings announcements, during news events, or for illiquid securities.
-
Inventory component ($\beta$): Compensation for the risk of holding an unbalanced inventory. Market makers who accumulate a large position face price risk. This component increases with volatility and position size.
-
Order processing component ($\gamma$): The fixed costs of providing liquidity — exchange fees, technology costs, opportunity costs of capital. This is the “base” spread that exists even without adverse selection or inventory risk.
The total spread is:
$$S = \alpha + \beta + \gamma$$
The Huang-Stoll (1997) decomposition estimates these components from the joint behavior of trade prices and quote revisions. Let $Q_t$ be the trade direction and $\Delta M_t$ be the change in the midpoint after a trade. Then:
$$\Delta M_t = \frac{S}{2}(\alpha + \beta) Q_t - \frac{S}{2} \beta Q_{t-1} + \epsilon_t$$
By regressing midpoint changes on current and lagged trade directions, we can estimate $\alpha$, $\beta$, and $\gamma = 1 - \alpha - \beta$.
Linear Regression for Spread Prediction
We model the spread as a linear function of observable features:
$$S_t = \beta_0 + \beta_1 \sigma_t + \beta_2 V_t + \beta_3 D_t + \beta_4 OI_t + \epsilon_t$$
where:
- $\sigma_t$ is recent realized volatility
- $V_t$ is recent trading volume
- $D_t$ is order book depth (total visible liquidity)
- $OI_t$ is order imbalance (bid size minus ask size, normalized)
The parameters are estimated by ordinary least squares:
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$
where $\mathbf{X}$ is the feature matrix and $\mathbf{y}$ is the vector of observed spreads.
ML Models for Spread Analysis
Regression for Spread Prediction
The simplest approach is linear regression, but spreads exhibit several properties that make naive OLS suboptimal:
- Heteroskedasticity: Spread variance increases during volatile periods
- Non-negativity: Spreads cannot be negative, but OLS can predict negative values
- Fat tails: Spread distributions are right-skewed with occasional spikes
- Regime dependence: The relationship between features and spreads changes across market conditions
To address these issues, we can use:
- Log-transformed regression: Model $\log(S_t)$ instead of $S_t$ to enforce positivity and reduce skewness.
- Quantile regression: Predict specific quantiles of the spread distribution rather than the mean, which is useful for worst-case execution cost analysis.
- Ridge/LASSO regression: Add regularization when using many correlated features from the order book.
In our implementation, we use standard linear regression as the foundation, which can be extended to these variants.
Classification for Spread Regime
Markets alternate between tight-spread (liquid) and wide-spread (illiquid) regimes. Identifying the current regime is valuable for:
- Market making: Adjust quote aggressiveness based on the regime
- Execution: Choose between aggressive and passive strategies
- Risk management: Widen stop-loss bands during illiquid regimes
We implement a threshold-based regime classifier that labels spread observations as “tight” or “wide” based on a configurable threshold (e.g., the historical median spread). A logistic regression or decision tree can then predict the regime from features.
The regime classification problem uses the same features as spread prediction but maps them to a binary outcome:
$$P(\text{wide spread}_t | \mathbf{x}_t) = \sigma(\mathbf{w}^T \mathbf{x}_t + b)$$
where $\sigma$ is the sigmoid function.
Applications
Market Making Profitability
A market maker who quotes at the best bid and ask captures the spread on each round-trip trade. The expected profit per unit time is:
$$\Pi = \lambda \cdot S - \lambda_{informed} \cdot \alpha \cdot S - \text{inventory risk cost}$$
where $\lambda$ is the total trade arrival rate and $\lambda_{informed}$ is the informed trade arrival rate. ML spread prediction helps the market maker:
- Set optimal quotes: Quote wider when the predicted spread (and thus adverse selection) is high
- Manage inventory: Skew quotes toward reducing inventory when the inventory component is large
- Select instruments: Focus on assets where the spread decomposition favors the order processing component
Execution Cost Estimation
For institutional traders, accurate spread prediction is essential for:
- Pre-trade analysis: Estimate the expected execution cost of a planned order
- Venue selection: Route orders to the venue with the lowest predicted spread
- Algorithm selection: Use aggressive algorithms when spreads are tight and passive algorithms when spreads are wide
The total transaction cost for an order of size $Q$ can be estimated as:
$$TC(Q) = \frac{S_{predicted}}{2} + \text{market impact}(Q) + \text{timing risk}$$
Our implementation provides a transaction cost estimator that combines spread prediction with simple market impact models.
Cryptocurrency-Specific Considerations
Cryptocurrency markets have unique spread characteristics:
- 24/7 trading: No opening/closing auctions; spread patterns follow global timezone cycles
- Cross-exchange arbitrage: Spreads on one exchange are influenced by prices on others
- High volatility: Spreads widen dramatically during large price moves
- Fragmented liquidity: Different stablecoin pairs (USDT, USDC, BUSD) for the same asset
Bybit provides both spot and perpetual futures orderbooks. Perpetual futures typically have tighter spreads due to higher liquidity and the funding rate mechanism.
Rust Implementation
Our Rust implementation provides the following components:
Core Spread Calculations
absolute_spread(): Computes the raw bid-ask spread from best bid and ask pricesrelative_spread(): Normalizes the spread by the midpoint for cross-asset comparisoneffective_spread(): Computes the actual execution cost using trade prices and directions
Roll Spread Estimator
roll_spread_estimator(): Implements the Roll (1984) model using the autocovariance of price changes. Handles the positive-covariance edge case by returning zero.
Linear Regression
LinearRegressionstruct withfit()andpredict()methods- Uses the normal equation $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ for parameter estimation
- Supports multi-feature prediction with bias term
Spread Regime Classifier
SpreadRegimeClassifierwith configurable threshold- Labels observations as
TightorWideregimes - Computes regime transition probabilities
Transaction Cost Estimator
- Combines spread prediction with market impact estimation
- Uses square-root impact model: $\text{impact} = k \cdot \sigma \cdot \sqrt{Q/V}$
- Provides total cost estimates for order planning
Bybit Integration
fetch_bybit_orderbook(): Fetches real-time orderbook data via the Bybit v5 API- Parses bid and ask levels with prices and quantities
- Supports any trading pair (default: BTCUSDT)
All components include comprehensive unit tests verifying correctness.
Bybit Data Integration
The implementation connects to Bybit’s v5 REST API to fetch real-time orderbook data:
GET https://api.bybit.com/v5/market/orderbook?category=spot&symbol=BTCUSDT&limit=25The response contains arrays of bid and ask levels, each with a price and quantity. We parse this into structured OrderBookLevel objects and compute spreads directly from the top-of-book quotes.
For time-series analysis, the example application polls the orderbook at regular intervals to build a spread history, then applies the regression model and regime classifier to the resulting dataset.
Key implementation details:
- Uses
reqwestwith blocking client for simplicity in examples - Handles API errors gracefully with
anyhowerror types - Supports configurable depth (number of orderbook levels)
Key Takeaways
-
The bid-ask spread is not a single number but a composite of adverse selection, inventory risk, and order processing costs. Understanding the decomposition is essential for both market makers and execution traders.
-
The Roll estimator provides a simple, model-free estimate of the effective spread from price data alone, without requiring quote data. It is a valuable baseline for any spread analysis.
-
ML models can predict future spreads from observable features like volatility, volume, order book depth, and order imbalance. Even linear regression captures a significant portion of spread variation.
-
Spread regime classification identifies periods of tight versus wide spreads. This is directly actionable: market makers should be more cautious in wide-spread regimes, while execution algorithms should be more aggressive in tight-spread regimes.
-
Transaction cost estimation combines spread prediction with market impact models to provide pre-trade cost forecasts. This is critical for portfolio managers evaluating whether a trade’s expected alpha exceeds its expected cost.
-
Cryptocurrency markets present unique challenges for spread modeling, including 24/7 trading, high volatility, and cross-exchange effects. Bybit’s API provides the orderbook data needed to build real-time spread models.
-
Rust implementation provides the performance needed for real-time spread computation and prediction, with sub-microsecond spread calculations and efficient matrix operations for regression.
References
- Roll, R. (1984). A simple implicit measure of the effective bid-ask spread in an efficient market. The Journal of Finance, 39(4), 1127-1139.
- Glosten, L., & Milgrom, P. (1985). Bid, ask, and transaction prices in a specialist market with heterogeneously informed traders. Journal of Financial Economics, 14(1), 71-100.
- Kyle, A. S. (1985). Continuous auctions and insider trading. Econometrica, 53(6), 1315-1335.
- Huang, R., & Stoll, H. (1997). The components of the bid-ask spread: A general approach. Review of Financial Studies, 10(4), 995-1034.
- Hasbrouck, J. (2009). Trading costs and returns for US equities: Estimating effective costs from daily data. The Journal of Finance, 64(3), 1445-1477.