Chapter 341: Graph Transformer Trading
Chapter 341: Graph Transformer Trading
Overview
Graph Transformers combine the power of Graph Neural Networks (GNNs) with Transformer attention mechanisms to model complex relational structures in financial markets. Unlike traditional time-series models that treat assets independently, Graph Transformers capture inter-asset dependencies, market correlations, and structural patterns that emerge from the cryptocurrency ecosystem.
Why Graph Transformers for Trading?
The Problem with Traditional Approaches
Traditional ML models for trading (LSTMs, GRUs, standard Transformers) treat each asset as an independent time series. However, financial markets are inherently relational:
- Correlations: BTC movements affect ETH, which affects other altcoins
- Sector relationships: DeFi tokens move together, as do Layer-2 solutions
- Market structure: Exchange order flow, whale wallets, on-chain transactions form a graph
- Information propagation: News/events spread through interconnected markets
Graph Transformer Solution
Graph Transformers model these relationships explicitly:
Traditional: X_t = f(X_{t-1}, X_{t-2}, ...) for each asset independently
Graph Transformer: X_t = f(X_{t-1}, A, E)where: A = adjacency matrix (which assets are connected) E = edge features (strength/type of connections)Technical Architecture
1. Graph Construction for Crypto Markets
Market Graph Structure:├── Nodes: Individual assets (BTC, ETH, SOL, ...)│ └── Node Features: Price, volume, volatility, order book metrics├── Edges: Relationships between assets│ ├── Correlation edges (rolling correlation > threshold)│ ├── Sector edges (same category: DeFi, L2, Meme, ...)│ ├── On-chain edges (token transfers, DEX swaps)│ └── Order flow edges (cross-exchange arbitrage patterns)└── Global features: Market-wide sentiment, total volume, dominance2. Graph Transformer Layer
The core innovation combines graph structure with self-attention:
Standard Transformer: Attention(Q, K, V) = softmax(QK^T / √d) V
Graph Transformer: Attention(Q, K, V, A, E) = softmax((QK^T + bias(A, E)) / √d) V
where: - bias(A, E) encodes graph structure into attention scores - Non-neighbors can have attention = 0 (sparse attention) - Edge features E modulate attention strength3. Positional Encoding for Graphs
Unlike sequences, graphs don’t have natural positions. We use:
- Laplacian Positional Encoding (LPE): Eigenvectors of graph Laplacian
- Random Walk Positional Encoding (RWPE): Landing probabilities from random walks
- Centrality Encoding: Node importance (degree, PageRank, betweenness)
# Laplacian Positional EncodingL = D - A # Laplacian matrixeigenvalues, eigenvectors = eig(L)pos_encoding = eigenvectors[:, 1:k+1] # First k non-trivial eigenvectorsModel Architecture
┌─────────────────────────────────────────────────────────────────┐│ GRAPH TRANSFORMER MODEL │├─────────────────────────────────────────────────────────────────┤│ ││ INPUT LAYER ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Node Features (per asset): │ ││ │ - Price returns (1m, 5m, 15m, 1h, 4h) │ ││ │ - Volume profile (buy/sell ratio, VWAP deviation) │ ││ │ - Order book (bid-ask spread, depth imbalance) │ ││ │ - Technical indicators (RSI, MACD, Bollinger) │ ││ └──────────────────────────────────────────────────────────┘ ││ ↓ ││ POSITIONAL ENCODING ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Laplacian PE + Random Walk PE + Centrality Encoding │ ││ └──────────────────────────────────────────────────────────┘ ││ ↓ ││ GRAPH TRANSFORMER BLOCKS (×N) ││ ┌──────────────────────────────────────────────────────────┐ ││ │ ┌────────────────────────────────────────────────────┐ │ ││ │ │ Multi-Head Graph Attention │ │ ││ │ │ - Query/Key/Value projections │ │ ││ │ │ - Edge-aware attention bias │ │ ││ │ │ - Sparse attention (graph-guided) │ │ ││ │ └────────────────────────────────────────────────────┘ │ ││ │ ↓ │ ││ │ ┌────────────────────────────────────────────────────┐ │ ││ │ │ Feed-Forward Network │ │ ││ │ │ - Linear → GELU → Linear │ │ ││ │ │ - Residual connections │ │ ││ │ └────────────────────────────────────────────────────┘ │ ││ │ ↓ │ ││ │ ┌────────────────────────────────────────────────────┐ │ ││ │ │ Edge Update (optional) │ │ ││ │ │ - Update edge features from node representations │ │ ││ │ └────────────────────────────────────────────────────┘ │ ││ └──────────────────────────────────────────────────────────┘ ││ ↓ ││ OUTPUT HEADS ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Per-Node: Direction prediction (up/down/neutral) │ ││ │ Per-Node: Return magnitude prediction │ ││ │ Per-Edge: Correlation change prediction │ ││ │ Global: Market regime classification │ ││ └──────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Trading Strategy
Signal Generation
def generate_signals(model, graph): # Forward pass through Graph Transformer node_embeddings = model(graph)
# Per-asset predictions direction_probs = model.direction_head(node_embeddings) # [N, 3] return_preds = model.return_head(node_embeddings) # [N, 1]
signals = [] for i, asset in enumerate(graph.nodes): prob_up = direction_probs[i, 0] prob_down = direction_probs[i, 2] expected_return = return_preds[i]
if prob_up > 0.6 and expected_return > 0.005: signals.append(Signal(asset, "LONG", confidence=prob_up)) elif prob_down > 0.6 and expected_return < -0.005: signals.append(Signal(asset, "SHORT", confidence=prob_down))
return signalsPortfolio Construction
Graph Transformers enable graph-aware portfolio construction:
def construct_portfolio(signals, graph, node_embeddings): # Use graph structure to diversify selected_assets = []
for signal in sorted(signals, key=lambda s: -s.confidence): asset = signal.asset
# Check if too correlated with already selected assets correlations = get_correlations(asset, selected_assets, graph) if max(correlations) < 0.7: # Diversification constraint selected_assets.append(asset)
# Weight by confidence and graph centrality weights = calculate_weights(selected_assets, signals, node_embeddings) return Portfolio(selected_assets, weights)Key Components
1. Multi-Head Graph Attention
class GraphAttention(nn.Module): def forward(self, x, edge_index, edge_attr): # x: [N, d] node features # edge_index: [2, E] edge connectivity # edge_attr: [E, d_e] edge features
Q = self.W_q(x) # [N, d] K = self.W_k(x) # [N, d] V = self.W_v(x) # [N, d]
# Compute attention scores for connected nodes src, dst = edge_index scores = (Q[dst] * K[src]).sum(dim=-1) / sqrt(d) # [E]
# Add edge bias edge_bias = self.edge_proj(edge_attr).squeeze() # [E] scores = scores + edge_bias
# Sparse softmax (only over neighbors) attn_weights = sparse_softmax(scores, dst, num_nodes=N)
# Aggregate out = scatter_add(attn_weights.unsqueeze(-1) * V[src], dst, dim=0) return out2. Edge Feature Updates
class EdgeUpdate(nn.Module): def forward(self, x, edge_index, edge_attr): src, dst = edge_index
# Combine source and destination node features edge_features = torch.cat([ x[src], x[dst], edge_attr, x[src] - x[dst], # Difference x[src] * x[dst], # Interaction ], dim=-1)
# Update edge features new_edge_attr = self.mlp(edge_features) return new_edge_attr3. Graph Pooling for Global Predictions
class GraphPooling(nn.Module): def forward(self, x, batch): # Attention-based pooling attn_scores = self.attention(x) # [N, 1] attn_weights = softmax(attn_scores, batch)
# Weighted sum per graph global_repr = scatter_add(attn_weights * x, batch, dim=0) return global_reprImplementation Details
Data Requirements
Cryptocurrency Market Data:├── OHLCV data (1-minute resolution minimum)│ └── Multiple assets (BTC, ETH, SOL, AVAX, ...)├── Order book snapshots (L2 data)│ └── Bid/Ask levels with sizes├── Trade flow data│ └── Individual trades with timestamps└── On-chain data (optional but valuable) ├── Whale wallet movements ├── Exchange inflows/outflows └── DEX trading volumes
Graph Construction Data:├── Rolling correlations (30-day window)├── Sector classifications├── Market cap rankings└── Trading pair relationshipsFeature Engineering
features = { # Price features (per node) 'returns_1m': log_return(close, 1), 'returns_5m': log_return(close, 5), 'returns_15m': log_return(close, 15), 'returns_1h': log_return(close, 60), 'volatility_1h': rolling_std(returns, 60),
# Volume features 'volume_ratio': volume / volume_ma_20, 'buy_sell_ratio': buy_volume / (buy_volume + sell_volume), 'vwap_deviation': (close - vwap) / vwap,
# Order book features 'spread_bps': (ask - bid) / mid * 10000, 'depth_imbalance': (bid_depth - ask_depth) / (bid_depth + ask_depth), 'ofi': order_flow_imbalance(book_changes),
# Technical indicators 'rsi_14': rsi(close, 14), 'macd_signal': macd(close) - macd_signal(close), 'bb_position': (close - bb_lower) / (bb_upper - bb_lower),
# Graph-specific features 'degree_centrality': graph.degree(node), 'pagerank': graph.pagerank(node), 'clustering_coef': graph.clustering(node),}Training Configuration
model: num_layers: 6 hidden_dim: 256 num_heads: 8 dropout: 0.1 edge_dim: 64 use_edge_features: true positional_encoding: "laplacian" num_pe_dims: 16
training: batch_size: 32 learning_rate: 0.0001 weight_decay: 0.01 warmup_steps: 1000 max_epochs: 100 early_stopping_patience: 10
data: train_split: 0.7 val_split: 0.15 test_split: 0.15 sequence_length: 60 # 1 hour of 1-minute data prediction_horizon: 5 # 5 minutes aheadKey Metrics
Model Performance
- Node-level Accuracy: Classification accuracy per asset
- Direction Accuracy: % correct up/down predictions
- Information Coefficient (IC): Correlation between predicted and actual returns
- Graph-aware IC: IC accounting for asset correlations
Trading Performance
- Sharpe Ratio: Risk-adjusted returns (target > 2.0)
- Sortino Ratio: Downside risk-adjusted returns
- Maximum Drawdown: Largest peak-to-trough decline
- Win Rate: % of profitable trades
- Profit Factor: Gross profit / Gross loss
Advantages of Graph Transformers
| Aspect | Traditional Models | Graph Transformers |
|---|---|---|
| Asset relationships | Ignored or manually engineered | Learned automatically |
| Information propagation | None | Natural via message passing |
| Market regime detection | Separate model needed | Built-in via global pooling |
| Correlation changes | Static assumptions | Dynamic, learned |
| Scalability | Linear in assets | Can be sparse (efficient) |
| Interpretability | Limited | Attention weights = explanations |
Comparison with Other Approaches
vs. Standard Transformers
- Standard: Treats assets as “tokens” in a sequence
- Graph Transformer: Explicitly encodes asset relationships
vs. GCN/GAT
- GCN/GAT: Fixed aggregation patterns
- Graph Transformer: Flexible attention over full graph + structure bias
vs. Temporal Fusion Transformer
- TFT: Temporal attention only
- Graph Transformer: Both temporal and cross-asset attention
Production Considerations
Inference Pipeline:├── Data Collection (Bybit WebSocket)│ └── Real-time OHLCV + order book updates├── Graph Update (every N minutes)│ └── Recalculate correlations, update edges├── Feature Computation│ └── Vectorized feature calculation├── Model Inference│ └── GPU-accelerated forward pass├── Signal Generation│ └── Threshold-based signal extraction└── Order Execution └── API integration with risk management
Latency Budget:├── Data collection: ~10ms (WebSocket)├── Feature computation: ~5ms (Rust)├── Graph construction: ~20ms (every 5 min)├── Model inference: ~15ms (GPU)├── Signal generation: ~1ms└── Total: ~50ms (excluding execution)Directory Structure
341_graph_transformer_trading/├── README.md # This file├── README.ru.md # Russian translation├── readme.simple.md # Beginner-friendly explanation├── readme.simple.ru.md # Russian beginner version└── rust_graph_transformer/ # Rust implementation ├── Cargo.toml ├── src/ │ ├── lib.rs # Library entry point │ ├── api/ # Bybit API client │ ├── graph/ # Graph construction & operations │ ├── transformer/ # Graph Transformer implementation │ ├── features/ # Feature engineering │ ├── strategy/ # Trading strategy │ └── backtest/ # Backtesting engine └── examples/ ├── fetch_market_data.rs ├── build_market_graph.rs ├── train_model.rs └── live_trading.rsReferences
-
A Generalization of Transformer Networks to Graphs (Dwivedi & Bresson, 2020)
-
Do Transformers Really Perform Bad for Graph Representation? (Ying et al., 2021)
- https://arxiv.org/abs/2106.05234 (Graphormer)
-
Recipe for a General, Powerful, Scalable Graph Transformer (Rampášek et al., 2022)
-
Graph Neural Networks for Financial Market Prediction (Various)
- Applications to stock/crypto markets
-
Temporal Graph Networks (Rossi et al., 2020)
Difficulty Level
Expert - Requires understanding of:
- Graph Neural Networks
- Transformer architecture
- Financial market microstructure
- PyTorch/tensor operations
- Distributed training (for large graphs)
Disclaimer
This chapter is for educational purposes only. Cryptocurrency trading involves substantial risk. The strategies described here have not been validated in live trading and should be thoroughly tested before any real-world application. Past performance does not guarantee future results.