Chapter 91: Transfer Learning for Trading
Chapter 91: Transfer Learning for Trading
Overview
Transfer Learning is a machine learning technique where a model trained on one task (source domain) is adapted to perform a different but related task (target domain). In trading, this means leveraging knowledge learned from one market, asset class, or time period to improve predictions on another. This approach is particularly valuable when labeled financial data is scarce, expensive to obtain, or when market conditions shift.
Table of Contents
- Introduction
- Theoretical Foundation
- Types of Transfer Learning
- Domain Adaptation Methods
- Application to Financial Markets
- Cross-Market Transfer
- Implementation Strategy
- Bybit Integration
- Risk Management
- Performance Metrics
- Comparison with Traditional Approaches
- References
Introduction
Traditional machine learning models for trading face several fundamental challenges:
- Data scarcity: New assets, markets, or instruments lack sufficient historical data
- Regime changes: Models trained on one market regime fail when conditions shift
- High labeling cost: Creating accurate labels for trading signals requires domain expertise
- Non-stationarity: Financial time series distributions evolve over time
Why Transfer Learning for Trading?
┌─────────────────────────────────────────────────────────────────────────┐│ The Transfer Learning Trading Problem │├─────────────────────────────────────────────────────────────────────────┤│ ││ Traditional Approach: Transfer Learning: ││ ───────────────────── ────────────────── ││ Train from scratch on Pre-train on data-rich source, ││ each new market/asset then adapt to target domain ││ ││ Problem: Insufficient data Solution: Leverage knowledge ││ for new or niche markets from related domains ││ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ Traditional: Transfer Learning: │ ││ │ Source: [S&P 500] Source: [S&P 500] ──┐ │ ││ │ Target: [New Crypto] ✗ Pre-train │ │ ││ │ (not enough data!) ↓ │ ││ │ Target: [New Crypto] ✓ │ ││ │ (fine-tune with little data)│ ││ │ │ ││ └────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Key Advantages
| Aspect | Traditional ML | Transfer Learning |
|---|---|---|
| Data requirement | Large per-task dataset | Small target dataset |
| Training time | Full training each time | Fast fine-tuning |
| Cold-start problem | Cannot handle | Handles well |
| Market regime adaptation | Retrain from scratch | Fine-tune existing model |
| Cross-market knowledge | No sharing | Knowledge reuse |
| New asset coverage | Needs extensive history | Works with limited history |
Theoretical Foundation
The Transfer Learning Framework
Transfer learning operates on the premise that knowledge gained from a source domain $D_S$ with task $T_S$ can improve learning in a target domain $D_T$ with task $T_T$, where $D_S \neq D_T$ or $T_S \neq T_T$.
A domain $D = {X, P(X)}$ consists of a feature space $X$ and a marginal probability distribution $P(X)$. A task $T = {Y, f(\cdot)}$ consists of a label space $Y$ and a predictive function $f(\cdot)$.
Mathematical Formulation
Objective: Given source domain data $D_S$ and learning task $T_S$, target domain data $D_T$ and learning task $T_T$, transfer learning aims to improve the target predictive function $f_T(\cdot)$ using knowledge from $D_S$ and $T_S$.
Domain Divergence: The discrepancy between source and target domains can be measured using:
$$d_A(D_S, D_T) = 2 \left(1 - 2\epsilon(h)\right)$$
where $\epsilon(h)$ is the error of a classifier $h$ trained to distinguish source from target samples. This is the A-distance (proxy A-distance).
Generalization Bound: For a hypothesis $h$ trained on source domain:
$$\epsilon_T(h) \leq \epsilon_S(h) + \frac{1}{2}d_A(D_S, D_T) + \lambda$$
where $\epsilon_T$ and $\epsilon_S$ are the target and source errors, and $\lambda$ is the error of the ideal joint hypothesis.
Key Components
┌────────────────────────────────────────────────────────────────────────┐│ Transfer Learning Architecture │├────────────────────────────────────────────────────────────────────────┤│ ││ SOURCE DOMAIN (Data-rich) TARGET DOMAIN (Data-scarce) ││ ┌─────────────────────┐ ┌─────────────────────┐ ││ │ Large Dataset │ │ Small Dataset │ ││ │ (e.g., S&P 500 │ │ (e.g., new crypto │ ││ │ 5 years daily) │ │ 3 months daily) │ ││ └──────────┬──────────┘ └──────────┬──────────┘ ││ │ │ ││ ▼ ▼ ││ ┌─────────────────────┐ ┌─────────────────────┐ ││ │ Feature Extractor │──── TRANSFER ──→│ Feature Extractor │ ││ │ (shared layers) │ (weights) │ (frozen/fine-tuned)│ ││ └──────────┬──────────┘ └──────────┬──────────┘ ││ │ │ ││ ▼ ▼ ││ ┌─────────────────────┐ ┌─────────────────────┐ ││ │ Source Classifier │ │ Target Classifier │ ││ │ (source task head) │ │ (new task head) │ ││ └─────────────────────┘ └─────────────────────┘ ││ ││ Training Phase: Adaptation Phase: ││ - Full training on source - Freeze lower layers ││ - Learn general features - Fine-tune upper layers ││ - Extract patterns - Train new classifier ││ │└────────────────────────────────────────────────────────────────────────┘Types of Transfer Learning
1. Inductive Transfer Learning
The source and target tasks are different, but related. The source domain data is used to improve the target task.
Trading Application: Train a model to predict volatility in equity markets, then transfer to predict volatility in crypto markets.
Source Task: Predict S&P 500 volatility (classification)Target Task: Predict BTC/USDT volatility (classification)
Shared Knowledge: Volatility patterns, mean-reversion dynamics, volume-price relationships2. Transductive Transfer Learning
The source and target tasks are the same, but the domains differ. The marginal distributions $P(X_S) \neq P(X_T)$.
Trading Application: A model trained on US equity data is adapted for emerging market equities where the feature distributions differ.
Source Domain: US Equities (high liquidity, tight spreads)Target Domain: Emerging Market Equities (low liquidity, wide spreads)
Same Task: Price direction predictionDifferent Distribution: Market microstructure features differ3. Unsupervised Transfer Learning
No labeled data is available in either domain. The focus is on learning representations.
Trading Application: Learn general market representations from unlabeled price data across many assets, then use these representations for downstream tasks.
Pre-training: Autoencoder on 10,000+ time series (no labels)Transfer: Use learned features for anomaly detection on new assetsDomain Adaptation Methods
Maximum Mean Discrepancy (MMD)
MMD measures the distance between source and target distributions in a reproducing kernel Hilbert space (RKHS):
$$MMD(D_S, D_T) = \left| \frac{1}{n_s}\sum_{i=1}^{n_s}\phi(x_i^s) - \frac{1}{n_t}\sum_{j=1}^{n_t}\phi(x_j^t) \right|_{\mathcal{H}}$$
By minimizing MMD during training, the model learns domain-invariant features.
Correlation Alignment (CORAL)
CORAL aligns the second-order statistics (covariance) of source and target features:
$$\mathcal{L}_{CORAL} = \frac{1}{4d^2}|C_S - C_T|_F^2$$
where $C_S$ and $C_T$ are the feature covariance matrices of source and target domains.
Adversarial Domain Adaptation
Uses a domain discriminator trained adversarially to create domain-invariant features:
┌─────────────────────────────────────────────────────────────────────────┐│ Adversarial Domain Adaptation │├─────────────────────────────────────────────────────────────────────────┤│ ││ Source Data ──┐ ││ ├──→ Feature Extractor ──┬──→ Task Classifier ││ Target Data ──┘ (G) │ (predicts labels) ││ │ ││ └──→ Domain Discriminator ││ (source vs target?) ││ ││ Training: ││ - Task Classifier: minimize task loss ││ - Domain Discriminator: maximize domain classification accuracy ││ - Feature Extractor: minimize task loss + maximize domain confusion ││ ││ Result: Features that are useful for the task but ││ indistinguishable between domains ││ │└─────────────────────────────────────────────────────────────────────────┘Fine-Tuning Strategies
Strategy 1: Feature Extraction (Freeze All)┌──────────────────────────────────────────┐│ Layer 1: [FROZEN] ← Pre-trained weights ││ Layer 2: [FROZEN] ← Pre-trained weights ││ Layer 3: [FROZEN] ← Pre-trained weights ││ New Head: [TRAINABLE] ← Random init │└──────────────────────────────────────────┘
Strategy 2: Partial Fine-Tuning┌──────────────────────────────────────────┐│ Layer 1: [FROZEN] ← Pre-trained weights ││ Layer 2: [FINE-TUNED] ← Small LR ││ Layer 3: [FINE-TUNED] ← Medium LR ││ New Head: [TRAINABLE] ← Large LR │└──────────────────────────────────────────┘
Strategy 3: Full Fine-Tuning┌──────────────────────────────────────────┐│ Layer 1: [FINE-TUNED] ← Very small LR ││ Layer 2: [FINE-TUNED] ← Small LR ││ Layer 3: [FINE-TUNED] ← Medium LR ││ New Head: [TRAINABLE] ← Large LR │└──────────────────────────────────────────┘Application to Financial Markets
Cross-Market Transfer
Transfer knowledge between different markets (e.g., stocks to crypto):
- Feature Alignment: Map features from both markets to a common space
- Pattern Transfer: Recognize similar patterns (momentum, mean-reversion) across markets
- Regime Detection: Transfer regime detection models across markets
Cross-Asset Transfer
Transfer within the same market across asset classes:
Source Assets (Data-rich): Target Assets (Data-scarce):├── BTC/USDT (years of data) → ├── New DeFi Token (weeks of data)├── ETH/USDT (years of data) → ├── Recently Listed Token└── Major Forex Pairs → └── Exotic Currency PairTemporal Transfer
Adapt models across different time periods or market regimes:
Pre-COVID Model ──→ Fine-tune ──→ Post-COVID Model(trained 2015-2019) (adapted to 2020+)
Bull Market Model ──→ Fine-tune ──→ Bear Market Model(trained on uptrend) (adapted to downtrend)Feature Extraction Pipeline
# Python example: Transfer Learning Pipelineimport torchimport torch.nn as nn
class TransferFeatureExtractor(nn.Module): """Feature extractor pre-trained on source domain."""
def __init__(self, input_dim, hidden_dim, feature_dim): super().__init__() self.layers = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.BatchNorm1d(hidden_dim), nn.ReLU(), nn.Dropout(0.3), nn.Linear(hidden_dim, hidden_dim), nn.BatchNorm1d(hidden_dim), nn.ReLU(), nn.Dropout(0.3), nn.Linear(hidden_dim, feature_dim), )
def forward(self, x): return self.layers(x)
class DomainAdaptiveTrader(nn.Module): """Trading model with domain adaptation via MMD."""
def __init__(self, feature_extractor, feature_dim, num_classes): super().__init__() self.feature_extractor = feature_extractor self.classifier = nn.Sequential( nn.Linear(feature_dim, 64), nn.ReLU(), nn.Linear(64, num_classes), )
def forward(self, x): features = self.feature_extractor(x) return self.classifier(features), features
def compute_mmd(self, source_features, target_features): """Maximum Mean Discrepancy between domains.""" source_mean = source_features.mean(dim=0) target_mean = target_features.mean(dim=0) return ((source_mean - target_mean) ** 2).sum()Implementation Strategy
Python Implementation
The Python implementation uses PyTorch for neural networks and provides:
TransferFeatureExtractor: Pre-trainable feature extraction networkDomainAdaptiveTrader: Trading model with domain adaptationTransferLearningPipeline: End-to-end transfer learning pipelineBacktestEngine: Strategy backtesting with transfer learning
Rust Implementation
The Rust implementation provides high-performance transfer learning:
use transfer_learning_trading::{ TransferNetwork, DomainAdapter, MarketDomain, FeatureExtractor, TradingStrategy,};
// Create a transfer learning networklet network = TransferNetwork::new( 20, // input dimension 128, // hidden dimension 64, // feature dimension true, // use domain adaptation);
// Pre-train on source domain (e.g., major crypto pairs)let source_domain = MarketDomain::crypto("BTC/USDT", "ETH/USDT");network.pretrain(&source_data, &source_labels, &pretrain_config);
// Adapt to target domain (e.g., new token)let target_domain = MarketDomain::crypto("NEW/USDT");let adapter = DomainAdapter::mmd(network.feature_extractor());adapter.adapt(&target_data, &adapt_config);
// Generate trading signalslet strategy = TradingStrategy::new(network, adapter);let signals = strategy.predict(&new_data);Quick Start
Python:
cd 91_transfer_learning_trading/pythonpip install torch numpy pandas scikit-learnpython train.py --source BTC/USDT --target ETH/USDT --method fine_tunepython backtest.py --model saved_model.pt --data target_data.csvRust:
cd 91_transfer_learning_tradingcargo run --example basic_transfercargo run --example domain_adaptationcargo run --example bybit_liveBybit Integration
Real-Time Data Pipeline
use transfer_learning_trading::data::BybitClient;
// Initialize clientlet client = BybitClient::new(BybitConfig::default());
// Fetch source domain data (established pairs)let btc_data = client.fetch_klines("BTCUSDT", "1h", 1000).await?;let eth_data = client.fetch_klines("ETHUSDT", "1h", 1000).await?;
// Fetch target domain data (newer pairs)let target_data = client.fetch_klines("NEWUSDT", "1h", 100).await?;
// Pre-train on source, adapt to targetlet model = TransferNetwork::new(20, 128, 64, true);model.pretrain_on_klines(&[btc_data, eth_data], &config);model.adapt_to_klines(&target_data, &adapt_config);Supported Endpoints
| Endpoint | Description | Use Case |
|---|---|---|
/v5/market/kline | Historical klines | Source/target data |
/v5/market/tickers | Current tickers | Live signals |
/v5/market/orderbook | Order book depth | Microstructure features |
/v5/market/recent-trade | Recent trades | Volume analysis |
Feature Engineering from Bybit Data
OHLCV Data → Feature Extraction:├── Price features: returns, log returns, volatility├── Volume features: VWAP, volume ratio, OBV├── Technical indicators: RSI, MACD, Bollinger Bands├── Microstructure: bid-ask spread, order imbalance└── Cross-asset: correlation, beta, relative strengthRisk Management
Transfer-Specific Risks
-
Negative Transfer: When source domain knowledge hurts target performance
- Mitigation: Monitor target validation loss; stop adaptation if diverging
-
Domain Shift: When target domain drifts from source over time
- Mitigation: Continuous adaptation with sliding window
-
Overfitting to Source: Model too specialized to source domain
- Mitigation: Regularization, early stopping, domain adversarial training
Risk Controls
┌─────────────────────────────────────────────────────────────────────────┐│ Transfer Learning Risk Framework │├─────────────────────────────────────────────────────────────────────────┤│ ││ Pre-Transfer Checks: ││ ├── Domain similarity score > threshold (0.7) ││ ├── Source model validation accuracy > minimum (60%) ││ └── Sufficient target data for validation (>100 samples) ││ ││ During Adaptation: ││ ├── Monitor MMD between source and target features ││ ├── Track target validation loss (stop if increasing) ││ ├── Limit fine-tuning epochs (prevent overfitting) ││ └── Gradient clipping during adaptation ││ ││ Post-Transfer Trading: ││ ├── Maximum position size: 2% of portfolio ││ ├── Stop-loss: 1.5% per trade ││ ├── Maximum daily drawdown: 3% ││ ├── Confidence threshold for signals: 0.65 ││ └── Reduce position size for low-similarity domains ││ │└─────────────────────────────────────────────────────────────────────────┘Performance Metrics
Model Evaluation
| Metric | Description | Target |
|---|---|---|
| Target Accuracy | Classification accuracy on target domain | > 55% |
| Transfer Gain | Improvement over training from scratch | > 5% |
| A-Distance | Domain divergence measure | < 1.5 |
| MMD | Feature distribution alignment | < 0.1 |
| Adaptation Speed | Epochs to converge on target | < 50 |
Trading Metrics
| Metric | Description | Target |
|---|---|---|
| Sharpe Ratio | Risk-adjusted return | > 1.5 |
| Sortino Ratio | Downside risk-adjusted return | > 2.0 |
| Maximum Drawdown | Largest peak-to-trough decline | < 15% |
| Win Rate | Percentage of profitable trades | > 52% |
| Profit Factor | Gross profit / Gross loss | > 1.3 |
| Calmar Ratio | Annual return / Max drawdown | > 1.0 |
Comparison with Traditional Approaches
| Aspect | Train from Scratch | Transfer Learning | Domain Adaptation |
|---|---|---|---|
| Data requirement | High (>1000 samples) | Low (>100 samples) | Medium (>200 samples) |
| Training time | Hours | Minutes (fine-tune) | Minutes-Hours |
| New market entry | Slow | Fast | Medium |
| Regime adaptation | Poor | Good | Excellent |
| Implementation complexity | Low | Medium | High |
| Risk of negative transfer | N/A | Medium | Low |
| Cross-market generalization | None | Good | Excellent |
Project Structure
91_transfer_learning_trading/├── README.md # This file├── README.ru.md # Russian translation├── readme.simple.md # Simplified English explanation├── readme.simple.ru.md # Simplified Russian explanation├── README.specify.md # Technical specification├── Cargo.toml # Rust project manifest├── src/│ ├── lib.rs # Library root│ ├── network/│ │ ├── mod.rs # Network module│ │ ├── feature_extractor.rs # Feature extraction layers│ │ ├── domain_adapter.rs # Domain adaptation methods│ │ └── transfer.rs # Transfer learning network│ ├── data/│ │ ├── mod.rs # Data module│ │ ├── features.rs # Feature engineering│ │ ├── bybit.rs # Bybit API integration│ │ └── stock.rs # Stock data loader│ ├── strategy/│ │ ├── mod.rs # Strategy module│ │ └── transfer_strategy.rs # Transfer-based trading strategy│ ├── training/│ │ ├── mod.rs # Training module│ │ └── trainer.rs # Transfer learning trainer│ └── utils/│ ├── mod.rs # Utils module│ └── metrics.rs # Performance metrics├── examples/│ ├── basic_transfer.rs # Basic transfer learning example│ ├── domain_adaptation.rs # Domain adaptation example│ └── bybit_live.rs # Live Bybit trading example├── tests/│ └── integration_tests.rs # Integration tests└── python/ ├── model.py # PyTorch model definitions ├── train.py # Training script └── backtest.py # Backtesting scriptReferences
-
A Survey on Transfer Learning - Pan, S.J. & Yang, Q. (2010). IEEE Transactions on Knowledge and Data Engineering.
-
Transfer Learning for Financial Time Series - arXiv:2102.09873 (2021)
-
Domain Adaptation for Financial Trading - Learning to adapt across financial domains using adversarial methods.
-
How transferable are features in deep neural networks? - Yosinski, J. et al. (2014). NeurIPS.
-
Deep Domain Confusion - Tzeng, E. et al. (2014). Maximizing domain confusion for domain adaptation.
-
CORAL: Correlation Alignment for Domain Adaptation - Sun, B. & Saenko, K. (2016). ECCV.