Chapter 355: ResNet for Time Series — Deep Residual Networks for Financial Forecasting
Chapter 355: ResNet for Time Series — Deep Residual Networks for Financial Forecasting
Overview
Residual Networks (ResNet) revolutionized image classification by enabling training of very deep networks through skip connections. This chapter adapts ResNet architecture for time series analysis in trading, allowing models to capture both short-term patterns and long-term dependencies in financial data.
Key Insight: Traditional deep networks suffer from vanishing gradients and degradation problems. ResNet’s skip connections allow gradients to flow directly through the network, enabling effective training of 50+ layer models for complex financial pattern recognition.
Trading Strategy
Core Strategy: Use deep ResNet architecture to learn hierarchical temporal features from OHLCV data and predict price direction with high accuracy.
Edge Factors:
- Deep feature hierarchies capture patterns at multiple time scales
- Residual connections preserve fine-grained information through deep layers
- Skip connections act as ensemble of shallow and deep features
- Robust to market regime changes through learned feature abstractions
Target Assets: Cryptocurrency pairs (BTC/USDT, ETH/USDT) from Bybit exchange
Technical Foundation
Why ResNet for Time Series?
Traditional CNNs for time series face challenges:
- Vanishing gradients in deep networks
- Loss of fine-grained temporal information
- Difficulty learning identity mappings
ResNet solves these through:
Regular Block: x → [Conv] → [BN] → [ReLU] → [Conv] → [BN] → y
Residual Block: x → [Conv] → [BN] → [ReLU] → [Conv] → [BN] → (+) → y | ↑ └──────────── Skip Connection ─────────────────┘
Output: y = F(x) + x (where F is the learned residual function)Mathematical Foundation
For a residual block:
- Input: x
- Target mapping: H(x)
- Residual function: F(x) = H(x) - x
- Output: y = F(x) + x
Why this works:
- If identity mapping is optimal, F(x) → 0 is easier to learn than H(x) → x
- Gradients flow directly through skip connections: ∂L/∂x includes direct path
- Network learns residual refinements rather than complete transformations
Time Series Adaptations
1D ResNet Block for Time Series:┌─────────────────────────────────────────────────────────────┐│ ││ Input: [batch, channels, time_steps] ││ ││ ┌─────────────────────────────────────────────────────┐ ││ │ Conv1D(kernel=3, padding=1) → BatchNorm1D → ReLU │ ││ │ Conv1D(kernel=3, padding=1) → BatchNorm1D │ ││ └─────────────────────────────────────────────────────┘ ││ ↓ ││ (add) ←── Skip Connection ←── Input ││ ↓ ││ ReLU ││ ↓ ││ Output: [batch, channels, time_steps] ││ │└─────────────────────────────────────────────────────────────┘Architecture Design
ResNet-18 for Time Series
┌──────────────────────────────────────────────────────────────────────┐│ ResNet-18 Time Series Architecture │├──────────────────────────────────────────────────────────────────────┤│ ││ Input Layer ││ ───────────── ││ [batch, 5, 256] ← 5 features (OHLCV), 256 time steps ││ ↓ ││ Conv1D(5→64, k=7, s=2, p=3) → BN → ReLU → MaxPool(k=3, s=2, p=1) ││ ↓ ││ [batch, 64, 64] ││ ││ Layer 1 (64 channels) ││ ───────────────────── ││ ResBlock × 2: [Conv1D(64→64) → BN → ReLU → Conv1D(64→64) → BN] + x ││ ↓ ││ [batch, 64, 64] ││ ││ Layer 2 (128 channels) ││ ────────────────────── ││ ResBlock × 2 with downsample: stride=2, projection for skip ││ ↓ ││ [batch, 128, 32] ││ ││ Layer 3 (256 channels) ││ ────────────────────── ││ ResBlock × 2 with downsample: stride=2, projection for skip ││ ↓ ││ [batch, 256, 16] ││ ││ Layer 4 (512 channels) ││ ────────────────────── ││ ResBlock × 2 with downsample: stride=2, projection for skip ││ ↓ ││ [batch, 512, 8] ││ ││ Output Head ││ ─────────── ││ AdaptiveAvgPool1D(1) → Flatten → Linear(512→3) ││ ↓ ││ [batch, 3] ← 3 classes: Down, Neutral, Up ││ │└──────────────────────────────────────────────────────────────────────┘Bottleneck Architecture (ResNet-50+)
Bottleneck Block:┌────────────────────────────────────────────────────┐│ ││ Input: [batch, 256, T] ││ ↓ ││ Conv1D(256→64, k=1) → BN → ReLU (reduce) ││ ↓ ││ Conv1D(64→64, k=3, p=1) → BN → ReLU (process) ││ ↓ ││ Conv1D(64→256, k=1) → BN (expand) ││ ↓ ││ (add) ←── Skip Connection ││ ↓ ││ ReLU ││ │└────────────────────────────────────────────────────┘
Benefit: 3x fewer parameters for same depthFeature Engineering for Trading
Input Features
┌────────────────────────────────────────────────────────────────┐│ Feature Engineering │├────────────────────────────────────────────────────────────────┤│ ││ Raw OHLCV Data: ││ ─────────────── ││ • Open, High, Low, Close, Volume (5 channels) ││ ││ Price-Derived Features: ││ ─────────────────────── ││ • Returns: (close[t] - close[t-1]) / close[t-1] ││ • Log Returns: log(close[t] / close[t-1]) ││ • High-Low Range: (high - low) / close ││ • Body Ratio: |close - open| / (high - low) ││ ││ Technical Indicators (as channels): ││ ──────────────────────────────────── ││ • RSI (14): Relative Strength Index ││ • MACD: Moving Average Convergence Divergence ││ • Bollinger Band %B ││ • ATR: Average True Range (normalized) ││ ││ Volume Features: ││ ──────────────── ││ • Volume Change: volume[t] / SMA(volume, 20) ││ • OBV Direction: sign of On-Balance Volume change ││ • VWAP Distance: (price - VWAP) / VWAP ││ ││ Final Input Shape: [batch, 15, 256] ││ (15 feature channels, 256 time steps) ││ │└────────────────────────────────────────────────────────────────┘Label Generation
def generate_labels(prices, forward_window=12, threshold=0.002): """ Generate 3-class labels based on future returns
Classes: - 0: Down (return < -threshold) - 1: Neutral (|return| <= threshold) - 2: Up (return > threshold) """ future_returns = prices.shift(-forward_window) / prices - 1
labels = np.where( future_returns > threshold, 2, np.where(future_returns < -threshold, 0, 1) ) return labelsTraining Procedure
Data Pipeline
┌────────────────────────────────────────────────────────────────┐│ Training Data Pipeline │├────────────────────────────────────────────────────────────────┤│ ││ 1. Fetch Data from Bybit ││ └─ BTCUSDT, ETHUSDT: 1-minute candles, 1 year ││ ││ 2. Feature Engineering ││ └─ Compute 15 feature channels per time step ││ ││ 3. Sequence Creation ││ └─ Sliding window: 256 steps input → 1 label ││ ││ 4. Train/Val/Test Split ││ └─ 70% / 15% / 15% (chronological, no leakage) ││ ││ 5. Normalization ││ └─ Z-score per channel, fitted on training only ││ ││ 6. Data Augmentation ││ ├─ Gaussian noise (σ=0.01) ││ ├─ Time warping (slight stretching/compression) ││ └─ Magnitude scaling (0.9-1.1×) ││ │└────────────────────────────────────────────────────────────────┘Training Configuration
config = { # Model 'model': 'ResNet18', 'input_channels': 15, 'num_classes': 3, 'dropout': 0.3,
# Training 'batch_size': 64, 'epochs': 100, 'learning_rate': 1e-3, 'weight_decay': 1e-4,
# Scheduler 'scheduler': 'CosineAnnealingLR', 'T_max': 100, 'eta_min': 1e-6,
# Early stopping 'patience': 15, 'min_delta': 1e-4,
# Class weights (for imbalanced data) 'class_weights': [1.2, 0.8, 1.2], # Up/Down weighted higher}Loss Functions
# Standard Cross-Entropy with class weightscriterion = nn.CrossEntropyLoss(weight=torch.tensor([1.2, 0.8, 1.2]))
# Alternative: Focal Loss for hard examplesclass FocalLoss(nn.Module): def __init__(self, alpha=0.25, gamma=2.0): super().__init__() self.alpha = alpha self.gamma = gamma
def forward(self, pred, target): ce_loss = F.cross_entropy(pred, target, reduction='none') pt = torch.exp(-ce_loss) focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss return focal_loss.mean()Trading Strategy Implementation
Signal Generation
class ResNetTradingStrategy: def __init__(self, model, threshold=0.6, position_sizing='kelly'): self.model = model self.threshold = threshold self.position_sizing = position_sizing
def generate_signal(self, features): """ Generate trading signal from model prediction """ with torch.no_grad(): probs = F.softmax(self.model(features), dim=1)
prob_down, prob_neutral, prob_up = probs[0].numpy()
# High confidence signals only if prob_up > self.threshold: return 'LONG', prob_up elif prob_down > self.threshold: return 'SHORT', prob_down else: return 'NEUTRAL', prob_neutral
def calculate_position_size(self, signal, confidence, portfolio_value): """ Kelly Criterion position sizing """ if self.position_sizing == 'kelly': # Simplified Kelly: f = p - (1-p)/b # where p = win probability, b = win/loss ratio edge = confidence - 0.5 kelly_fraction = max(0, min(0.25, edge)) # Cap at 25% return portfolio_value * kelly_fraction else: return portfolio_value * 0.1 # Fixed 10%Risk Management
┌────────────────────────────────────────────────────────────────┐│ Risk Management Rules │├────────────────────────────────────────────────────────────────┤│ ││ Position Sizing: ││ ──────────────── ││ • Maximum position: 25% of portfolio ││ • Scale with confidence: size = base × (confidence - 0.5) × 2 ││ ││ Stop Loss: ││ ────────── ││ • Fixed: -2% from entry ││ • ATR-based: -2 × ATR(14) ││ • Trailing: -1.5 × ATR(14) from peak ││ ││ Take Profit: ││ ──────────── ││ • Fixed: +3% from entry (1.5:1 risk-reward) ││ • Dynamic: When signal reverses or confidence drops ││ ││ Time-Based Exits: ││ ───────────────── ││ • Maximum hold: 12 hours ││ • Force exit if new prediction window available ││ ││ Drawdown Protection: ││ ──────────────────── ││ • Reduce size by 50% if daily drawdown > 3% ││ • Stop trading if weekly drawdown > 10% ││ │└────────────────────────────────────────────────────────────────┘Performance Metrics
Classification Metrics
| Metric | Target | Description |
|---|---|---|
| Accuracy | > 45% | Overall prediction accuracy (3-class) |
| Precision (Up) | > 55% | True up predictions / All up predictions |
| Precision (Down) | > 55% | True down predictions / All down predictions |
| F1-Score | > 0.50 | Harmonic mean of precision and recall |
| AUC-ROC | > 0.60 | Area under ROC curve |
Trading Metrics
| Metric | Target | Description |
|---|---|---|
| Sharpe Ratio | > 1.5 | Risk-adjusted returns (annualized) |
| Sortino Ratio | > 2.0 | Downside risk-adjusted returns |
| Max Drawdown | < 15% | Maximum peak-to-trough decline |
| Win Rate | > 52% | Percentage of profitable trades |
| Profit Factor | > 1.3 | Gross profit / Gross loss |
| Calmar Ratio | > 1.0 | Annual return / Max drawdown |
Project Structure
355_resnet_timeseries/├── README.md # This file├── README.ru.md # Russian translation├── readme.simple.md # Simple explanation (English)├── readme.simple.ru.md # Simple explanation (Russian)└── rust_resnet/ # Rust implementation ├── Cargo.toml └── src/ ├── lib.rs # Library root ├── api/ # Bybit API client │ ├── mod.rs │ ├── client.rs │ └── types.rs ├── model/ # ResNet implementation │ ├── mod.rs │ ├── resnet.rs │ ├── blocks.rs │ └── layers.rs ├── data/ # Data processing │ ├── mod.rs │ ├── features.rs │ ├── dataset.rs │ └── preprocessing.rs ├── strategy/ # Trading strategy │ ├── mod.rs │ ├── signals.rs │ └── risk.rs ├── utils/ # Utilities │ ├── mod.rs │ └── metrics.rs └── bin/ # Executable examples ├── fetch_data.rs ├── train_model.rs ├── predict.rs └── backtest.rsKey Innovations
1. Multi-Scale Residual Blocks
// Process at multiple time scales simultaneouslypub struct MultiScaleResBlock { branch_3: Conv1d, // kernel_size = 3 branch_5: Conv1d, // kernel_size = 5 branch_7: Conv1d, // kernel_size = 7 fusion: Conv1d, // Combine branches}2. Attention-Enhanced Residuals
// Add channel attention to residual blockspub struct SEResBlock { conv_block: ResidualBlock, squeeze: AdaptiveAvgPool1d, excitation: Sequential<Linear, ReLU, Linear, Sigmoid>,}
// Recalibrate feature channels based on importancefn forward(&self, x: Tensor) -> Tensor { let residual = self.conv_block.forward(x); let weights = self.excitation.forward(self.squeeze.forward(&residual)); residual * weights.unsqueeze(-1) + x}3. Temporal Positional Encoding
// Inject position information for time awarenesspub fn positional_encoding(seq_len: usize, d_model: usize) -> Tensor { let positions: Vec<f32> = (0..seq_len).map(|i| i as f32).collect(); let dims: Vec<f32> = (0..d_model) .map(|i| 1.0 / 10000_f32.powf(i as f32 / d_model as f32)) .collect();
// PE(pos, 2i) = sin(pos / 10000^(2i/d)) // PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) // ...}References
-
Deep Residual Learning for Image Recognition (He et al., 2015)
- Original ResNet paper: https://arxiv.org/abs/1512.03385
-
Time Series Classification from Scratch with Deep Neural Networks (Wang et al., 2017)
- ResNet for time series: https://arxiv.org/abs/1611.06455
-
InceptionTime: Finding AlexNet for Time Series Classification (Fawaz et al., 2019)
- Comprehensive comparison: https://arxiv.org/abs/1909.04939
-
Deep Learning for Time Series Forecasting (Brownlee, 2018)
- Practical guide for financial applications
-
ResNet-based Cryptocurrency Price Prediction (Academic papers 2021-2023)
- Various applications to crypto markets
Difficulty Level
Intermediate to Advanced
Prerequisites:
- Understanding of CNNs and deep learning fundamentals
- Familiarity with time series analysis concepts
- Basic knowledge of trading and backtesting
- Rust programming experience (for implementation)
Estimated Learning Time: 15-20 hours
Next Steps
- Start with data collection - Run
fetch_datato get Bybit market data - Understand the model - Study ResNet architecture in
model/resnet.rs - Train and evaluate - Use
train_modeland analyze metrics - Backtest strategy - Run
backtestto evaluate trading performance - Experiment - Try different architectures, features, and parameters