Chapter 341: Graph Transformer Trading

Overview

Graph Transformers combine the power of Graph Neural Networks (GNNs) with Transformer attention mechanisms to model complex relational structures in financial markets. Unlike traditional time-series models that treat assets independently, Graph Transformers capture inter-asset dependencies, market correlations, and structural patterns that emerge from the cryptocurrency ecosystem.

Why Graph Transformers for Trading?

The Problem with Traditional Approaches

Traditional ML models for trading (LSTMs, GRUs, standard Transformers) treat each asset as an independent time series. However, financial markets are inherently relational:

Correlations: BTC movements affect ETH, which affects other altcoins
Sector relationships: DeFi tokens move together, as do Layer-2 solutions
Market structure: Exchange order flow, whale wallets, on-chain transactions form a graph
Information propagation: News/events spread through interconnected markets

Graph Transformer Solution

Graph Transformers model these relationships explicitly:

Traditional: X_t = f(X_{t-1}, X_{t-2}, ...) for each asset independently

Graph Transformer: X_t = f(X_{t-1}, A, E)
where:
  A = adjacency matrix (which assets are connected)
  E = edge features (strength/type of connections)

Technical Architecture

1. Graph Construction for Crypto Markets

Market Graph Structure:
├── Nodes: Individual assets (BTC, ETH, SOL, ...)
│   └── Node Features: Price, volume, volatility, order book metrics
├── Edges: Relationships between assets
│   ├── Correlation edges (rolling correlation > threshold)
│   ├── Sector edges (same category: DeFi, L2, Meme, ...)
│   ├── On-chain edges (token transfers, DEX swaps)
│   └── Order flow edges (cross-exchange arbitrage patterns)
└── Global features: Market-wide sentiment, total volume, dominance

2. Graph Transformer Layer

The core innovation combines graph structure with self-attention:

Standard Transformer:
  Attention(Q, K, V) = softmax(QK^T / √d) V

Graph Transformer:
  Attention(Q, K, V, A, E) = softmax((QK^T + bias(A, E)) / √d) V

where:
  - bias(A, E) encodes graph structure into attention scores
  - Non-neighbors can have attention = 0 (sparse attention)
  - Edge features E modulate attention strength

3. Positional Encoding for Graphs

Unlike sequences, graphs don’t have natural positions. We use:

Laplacian Positional Encoding (LPE): Eigenvectors of graph Laplacian
Random Walk Positional Encoding (RWPE): Landing probabilities from random walks
Centrality Encoding: Node importance (degree, PageRank, betweenness)

# Laplacian Positional Encoding
L = D - A  # Laplacian matrix
eigenvalues, eigenvectors = eig(L)
pos_encoding = eigenvectors[:, 1:k+1]  # First k non-trivial eigenvectors

Model Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    GRAPH TRANSFORMER MODEL                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  INPUT LAYER                                                     │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ Node Features (per asset):                                │   │
│  │   - Price returns (1m, 5m, 15m, 1h, 4h)                  │   │
│  │   - Volume profile (buy/sell ratio, VWAP deviation)       │   │
│  │   - Order book (bid-ask spread, depth imbalance)          │   │
│  │   - Technical indicators (RSI, MACD, Bollinger)           │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              ↓                                   │
│  POSITIONAL ENCODING                                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ Laplacian PE + Random Walk PE + Centrality Encoding       │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              ↓                                   │
│  GRAPH TRANSFORMER BLOCKS (×N)                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ ┌────────────────────────────────────────────────────┐   │   │
│  │ │ Multi-Head Graph Attention                          │   │   │
│  │ │   - Query/Key/Value projections                     │   │   │
│  │ │   - Edge-aware attention bias                       │   │   │
│  │ │   - Sparse attention (graph-guided)                 │   │   │
│  │ └────────────────────────────────────────────────────┘   │   │
│  │                         ↓                                │   │
│  │ ┌────────────────────────────────────────────────────┐   │   │
│  │ │ Feed-Forward Network                                │   │   │
│  │ │   - Linear → GELU → Linear                          │   │   │
│  │ │   - Residual connections                            │   │   │
│  │ └────────────────────────────────────────────────────┘   │   │
│  │                         ↓                                │   │
│  │ ┌────────────────────────────────────────────────────┐   │   │
│  │ │ Edge Update (optional)                              │   │   │
│  │ │   - Update edge features from node representations  │   │   │
│  │ └────────────────────────────────────────────────────┘   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              ↓                                   │
│  OUTPUT HEADS                                                    │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ Per-Node: Direction prediction (up/down/neutral)          │   │
│  │ Per-Node: Return magnitude prediction                     │   │
│  │ Per-Edge: Correlation change prediction                   │   │
│  │ Global: Market regime classification                      │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Trading Strategy

Signal Generation

def generate_signals(model, graph):
    # Forward pass through Graph Transformer
    node_embeddings = model(graph)

    # Per-asset predictions
    direction_probs = model.direction_head(node_embeddings)  # [N, 3]
    return_preds = model.return_head(node_embeddings)        # [N, 1]

    signals = []
    for i, asset in enumerate(graph.nodes):
        prob_up = direction_probs[i, 0]
        prob_down = direction_probs[i, 2]
        expected_return = return_preds[i]

        if prob_up > 0.6 and expected_return > 0.005:
            signals.append(Signal(asset, "LONG", confidence=prob_up))
        elif prob_down > 0.6 and expected_return < -0.005:
            signals.append(Signal(asset, "SHORT", confidence=prob_down))

    return signals

Portfolio Construction

Graph Transformers enable graph-aware portfolio construction:

def construct_portfolio(signals, graph, node_embeddings):
    # Use graph structure to diversify
    selected_assets = []

    for signal in sorted(signals, key=lambda s: -s.confidence):
        asset = signal.asset

        # Check if too correlated with already selected assets
        correlations = get_correlations(asset, selected_assets, graph)
        if max(correlations) < 0.7:  # Diversification constraint
            selected_assets.append(asset)

    # Weight by confidence and graph centrality
    weights = calculate_weights(selected_assets, signals, node_embeddings)
    return Portfolio(selected_assets, weights)

Key Components

1. Multi-Head Graph Attention

class GraphAttention(nn.Module):
    def forward(self, x, edge_index, edge_attr):
        # x: [N, d] node features
        # edge_index: [2, E] edge connectivity
        # edge_attr: [E, d_e] edge features

        Q = self.W_q(x)  # [N, d]
        K = self.W_k(x)  # [N, d]
        V = self.W_v(x)  # [N, d]

        # Compute attention scores for connected nodes
        src, dst = edge_index
        scores = (Q[dst] * K[src]).sum(dim=-1) / sqrt(d)  # [E]

        # Add edge bias
        edge_bias = self.edge_proj(edge_attr).squeeze()  # [E]
        scores = scores + edge_bias

        # Sparse softmax (only over neighbors)
        attn_weights = sparse_softmax(scores, dst, num_nodes=N)

        # Aggregate
        out = scatter_add(attn_weights.unsqueeze(-1) * V[src], dst, dim=0)
        return out

2. Edge Feature Updates

class EdgeUpdate(nn.Module):
    def forward(self, x, edge_index, edge_attr):
        src, dst = edge_index

        # Combine source and destination node features
        edge_features = torch.cat([
            x[src],
            x[dst],
            edge_attr,
            x[src] - x[dst],  # Difference
            x[src] * x[dst],  # Interaction
        ], dim=-1)

        # Update edge features
        new_edge_attr = self.mlp(edge_features)
        return new_edge_attr

3. Graph Pooling for Global Predictions

class GraphPooling(nn.Module):
    def forward(self, x, batch):
        # Attention-based pooling
        attn_scores = self.attention(x)  # [N, 1]
        attn_weights = softmax(attn_scores, batch)

        # Weighted sum per graph
        global_repr = scatter_add(attn_weights * x, batch, dim=0)
        return global_repr

Implementation Details

Data Requirements

Cryptocurrency Market Data:
├── OHLCV data (1-minute resolution minimum)
│   └── Multiple assets (BTC, ETH, SOL, AVAX, ...)
├── Order book snapshots (L2 data)
│   └── Bid/Ask levels with sizes
├── Trade flow data
│   └── Individual trades with timestamps
└── On-chain data (optional but valuable)
    ├── Whale wallet movements
    ├── Exchange inflows/outflows
    └── DEX trading volumes

Graph Construction Data:
├── Rolling correlations (30-day window)
├── Sector classifications
├── Market cap rankings
└── Trading pair relationships

Feature Engineering

features = {
    # Price features (per node)
    'returns_1m': log_return(close, 1),
    'returns_5m': log_return(close, 5),
    'returns_15m': log_return(close, 15),
    'returns_1h': log_return(close, 60),
    'volatility_1h': rolling_std(returns, 60),

    # Volume features
    'volume_ratio': volume / volume_ma_20,
    'buy_sell_ratio': buy_volume / (buy_volume + sell_volume),
    'vwap_deviation': (close - vwap) / vwap,

    # Order book features
    'spread_bps': (ask - bid) / mid * 10000,
    'depth_imbalance': (bid_depth - ask_depth) / (bid_depth + ask_depth),
    'ofi': order_flow_imbalance(book_changes),

    # Technical indicators
    'rsi_14': rsi(close, 14),
    'macd_signal': macd(close) - macd_signal(close),
    'bb_position': (close - bb_lower) / (bb_upper - bb_lower),

    # Graph-specific features
    'degree_centrality': graph.degree(node),
    'pagerank': graph.pagerank(node),
    'clustering_coef': graph.clustering(node),
}

Training Configuration

model:
  num_layers: 6
  hidden_dim: 256
  num_heads: 8
  dropout: 0.1
  edge_dim: 64
  use_edge_features: true
  positional_encoding: "laplacian"
  num_pe_dims: 16

training:
  batch_size: 32
  learning_rate: 0.0001
  weight_decay: 0.01
  warmup_steps: 1000
  max_epochs: 100
  early_stopping_patience: 10

data:
  train_split: 0.7
  val_split: 0.15
  test_split: 0.15
  sequence_length: 60  # 1 hour of 1-minute data
  prediction_horizon: 5  # 5 minutes ahead

Key Metrics

Model Performance

Node-level Accuracy: Classification accuracy per asset
Direction Accuracy: % correct up/down predictions
Information Coefficient (IC): Correlation between predicted and actual returns
Graph-aware IC: IC accounting for asset correlations

Trading Performance

Sharpe Ratio: Risk-adjusted returns (target > 2.0)
Sortino Ratio: Downside risk-adjusted returns
Maximum Drawdown: Largest peak-to-trough decline
Win Rate: % of profitable trades
Profit Factor: Gross profit / Gross loss

Advantages of Graph Transformers

Aspect	Traditional Models	Graph Transformers
Asset relationships	Ignored or manually engineered	Learned automatically
Information propagation	None	Natural via message passing
Market regime detection	Separate model needed	Built-in via global pooling
Correlation changes	Static assumptions	Dynamic, learned
Scalability	Linear in assets	Can be sparse (efficient)
Interpretability	Limited	Attention weights = explanations

Comparison with Other Approaches

vs. Standard Transformers

Standard: Treats assets as “tokens” in a sequence
Graph Transformer: Explicitly encodes asset relationships

vs. GCN/GAT

GCN/GAT: Fixed aggregation patterns
Graph Transformer: Flexible attention over full graph + structure bias

vs. Temporal Fusion Transformer

TFT: Temporal attention only
Graph Transformer: Both temporal and cross-asset attention

Production Considerations

Inference Pipeline:
├── Data Collection (Bybit WebSocket)
│   └── Real-time OHLCV + order book updates
├── Graph Update (every N minutes)
│   └── Recalculate correlations, update edges
├── Feature Computation
│   └── Vectorized feature calculation
├── Model Inference
│   └── GPU-accelerated forward pass
├── Signal Generation
│   └── Threshold-based signal extraction
└── Order Execution
    └── API integration with risk management

Latency Budget:
├── Data collection: ~10ms (WebSocket)
├── Feature computation: ~5ms (Rust)
├── Graph construction: ~20ms (every 5 min)
├── Model inference: ~15ms (GPU)
├── Signal generation: ~1ms
└── Total: ~50ms (excluding execution)

Directory Structure

341_graph_transformer_trading/
├── README.md                    # This file
├── README.ru.md                 # Russian translation
├── readme.simple.md             # Beginner-friendly explanation
├── readme.simple.ru.md          # Russian beginner version
└── rust_graph_transformer/      # Rust implementation
    ├── Cargo.toml
    ├── src/
    │   ├── lib.rs               # Library entry point
    │   ├── api/                 # Bybit API client
    │   ├── graph/               # Graph construction & operations
    │   ├── transformer/         # Graph Transformer implementation
    │   ├── features/            # Feature engineering
    │   ├── strategy/            # Trading strategy
    │   └── backtest/            # Backtesting engine
    └── examples/
        ├── fetch_market_data.rs
        ├── build_market_graph.rs
        ├── train_model.rs
        └── live_trading.rs

References

A Generalization of Transformer Networks to Graphs (Dwivedi & Bresson, 2020)
- https://arxiv.org/abs/2012.09699
Do Transformers Really Perform Bad for Graph Representation? (Ying et al., 2021)
- https://arxiv.org/abs/2106.05234 (Graphormer)
Recipe for a General, Powerful, Scalable Graph Transformer (Rampášek et al., 2022)
- https://arxiv.org/abs/2205.12454 (GPS)
Graph Neural Networks for Financial Market Prediction (Various)
- Applications to stock/crypto markets
Temporal Graph Networks (Rossi et al., 2020)
- https://arxiv.org/abs/2006.10637

Difficulty Level

Expert - Requires understanding of:

Graph Neural Networks
Transformer architecture
Financial market microstructure
PyTorch/tensor operations
Distributed training (for large graphs)

Disclaimer

This chapter is for educational purposes only. Cryptocurrency trading involves substantial risk. The strategies described here have not been validated in live trading and should be thoroughly tested before any real-world application. Past performance does not guarantee future results.