Chapter 57: Longformer for Financial Analysis
Chapter 57: Longformer for Financial Analysis
This chapter explores Longformer, a transformer architecture that uses sliding window attention combined with global attention to efficiently process long financial documents and time series data with O(n) complexity instead of the standard O(n²).
Contents
- Introduction to Longformer
- Mathematical Foundation
- Longformer Architecture
- Financial Applications
- Practical Examples
- Rust Implementation
- Python Implementation
- Best Practices
- Resources
Introduction to Longformer
The Long Document Challenge
Standard transformers compute attention between all pairs of tokens, resulting in O(n²) complexity. For financial applications, this becomes problematic:
Document Processing Challenge:
Financial Document Lengths:- SEC 10-K Filing: ~50,000 tokens- Earnings Call Transcript: ~10,000 tokens- Research Report: ~5,000 tokens- News Article: ~1,000 tokens
Standard Transformer (BERT):- Max sequence: 512 tokens- 10-K filing would need: 100+ chunks!- Context is lost between chunks
Trading Time Series:- 1 month of minute data: 43,200 data points- Standard attention: 43,200² = 1.86 billion operations!Key Innovation: Hybrid Attention
Longformer introduces a novel combination of two attention patterns:
┌─────────────────────────────────────────────────────────────────────┐│ LONGFORMER ATTENTION PATTERNS │├─────────────────────────────────────────────────────────────────────┤│ ││ 1. SLIDING WINDOW ATTENTION (Local) ││ Each token attends to w tokens on each side ││ ││ Token: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] ││ Window=2: ←─[4]─→ ││ attends to [2,3,4,5,6] ││ ││ 2. GLOBAL ATTENTION (Task-specific) ││ Certain tokens attend to ALL positions ││ ││ [CLS] token for classification: ││ [CLS] ←→ [all tokens] ││ ││ Combined Pattern: ││ ┌────────────────────────────┐ ││ │▓░░░░░░░░░░░░░░░░░░░░░░░░░▓│ ← Global tokens (CLS, SEP) ││ │░▓▓▓░░░░░░░░░░░░░░░░░░░░░░░│ ││ │░▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░│ ← Sliding window ││ │░░▓▓▓▓░░░░░░░░░░░░░░░░░░░░░│ (diagonal band) ││ │░░░▓▓▓▓░░░░░░░░░░░░░░░░░░░░│ ││ │░░░░▓▓▓▓░░░░░░░░░░░░░░░░░░░│ ││ │ ...diagonal... │ ││ │▓░░░░░░░░░░░░░░░░░░░░░░░▓▓▓│ ││ └────────────────────────────┘ ││ ││ Complexity: O(n × w) instead of O(n²) ││ │└─────────────────────────────────────────────────────────────────────┘Why Longformer for Finance
| Use Case | Standard Transformer | Longformer | Benefit |
|---|---|---|---|
| SEC 10-K Filing | Truncate to 512 tokens | Full 50K tokens | Complete document analysis |
| Earnings Calls | Lose context between chunks | Single pass | Better sentiment extraction |
| Minute-level Trading | ~8 hours max | ~30 days | Longer-term pattern recognition |
| Multi-document QA | Separate processing | Joint attention | Cross-document reasoning |
| News + Price Fusion | Limited context | Extended context | Better event correlation |
Mathematical Foundation
Standard Self-Attention
The standard self-attention mechanism computes:
Attention(Q, K, V) = softmax(QK^T / √d_k) · VWhere:
- Q, K, V: Query, Key, Value matrices ∈ ℝ^(n×d)
- d_k: Key dimension (typically d_model / n_heads)
- n: Sequence length
The attention matrix QK^T is n×n, requiring O(n²) time and space.
Sliding Window Attention
Sliding window attention restricts each token to attend only to a fixed window of neighboring tokens:
For token at position i with window size w: Attention_i = {j : |i - j| ≤ w/2}
Example with w=4 (window of 4 tokens total):Position 5 attends to: [3, 4, 5, 6, 7]
Attention Pattern for w=4, n=10: 1 2 3 4 5 6 7 8 9 10 1 [■ ■ ■ ░ ░ ░ ░ ░ ░ ░] 2 [■ ■ ■ ■ ░ ░ ░ ░ ░ ░] 3 [■ ■ ■ ■ ■ ░ ░ ░ ░ ░] 4 [░ ■ ■ ■ ■ ■ ░ ░ ░ ░] 5 [░ ░ ■ ■ ■ ■ ■ ░ ░ ░] 6 [░ ░ ░ ■ ■ ■ ■ ■ ░ ░] 7 [░ ░ ░ ░ ■ ■ ■ ■ ■ ░] 8 [░ ░ ░ ░ ░ ■ ■ ■ ■ ■] 9 [░ ░ ░ ░ ░ ░ ■ ■ ■ ■] 10 [░ ░ ░ ░ ░ ░ ░ ■ ■ ■]
■ = attention computed, ░ = no attention
Complexity: O(n × w) where w << nThe receptive field grows linearly with the number of layers:
- Layer 1: w tokens
- Layer 2: 2w tokens
- Layer L: L×w tokens
For L=12 layers and w=512, the top layer has a receptive field of 6,144 tokens.
Dilated Sliding Window
To increase receptive field without adding computation, Longformer supports dilated windows:
Dilated Window with dilation d=2, w=4:
Standard (d=1): [1, 2, 3, 4, 5] → attends to consecutive tokensDilated (d=2): [1, _, 3, _, 5, _, 7, _, 9] → skips every other token
Position 5 with d=2, w=4 attends to: [1, 3, 5, 7, 9]
This allows reaching position 9 with same computation as reaching position 5!
Receptive field with dilation:- Layer l with dilation d_l: d_l × w tokens- Multi-scale: lower layers use d=1 (local), higher layers use d=2+ (global)Global Attention
Some tokens need full sequence attention (task-specific):
# Global attention for specific tokensglobal_attention_mask = torch.zeros(batch_size, seq_len)
# For classification: [CLS] token gets global attentionglobal_attention_mask[:, 0] = 1 # [CLS] at position 0
# For QA: Question tokens get global attentionglobal_attention_mask[:, question_start:question_end] = 1
# Global attention is asymmetric:# 1. Global tokens attend to ALL tokens (full row)# 2. ALL tokens attend to global tokens (full column)Global attention pattern visualization:
With global attention on token 1 and 8 (G = global):
1 2 3 4 5 6 7 8 9 10 1 [G G G G G G G G G G] ← Global token: full attention 2 [G ■ ■ ■ ░ ░ ░ G ░ ░] 3 [G ■ ■ ■ ■ ░ ░ G ░ ░] 4 [G ░ ■ ■ ■ ■ ░ G ░ ░] 5 [G ░ ░ ■ ■ ■ ■ G ░ ░] 6 [G ░ ░ ░ ■ ■ ■ G ■ ░] 7 [G ░ ░ ░ ░ ■ ■ G ■ ■] 8 [G G G G G G G G G G] ← Global token: full attention 9 [G ░ ░ ░ ░ ░ ■ G ■ ■] 10 [G ░ ░ ░ ░ ░ ░ G ■ ■] ↑ ↑ │ └─ Column: all attend to global └─ Column: all attend to globalLongformer Architecture
Attention Pattern Design
class LongformerAttention(nn.Module): """ Longformer attention combining sliding window and global attention.
Key features: - Sliding window for local context (O(n×w)) - Global attention for task-specific tokens - Configurable attention patterns per layer """
def __init__( self, d_model: int, n_heads: int, window_size: int = 512, dilation: int = 1, dropout: float = 0.1 ): super().__init__()
self.d_model = d_model self.n_heads = n_heads self.head_dim = d_model // n_heads self.window_size = window_size self.one_sided_window = window_size // 2 self.dilation = dilation
# Separate projections for local and global attention self.q_proj = nn.Linear(d_model, d_model) self.k_proj = nn.Linear(d_model, d_model) self.v_proj = nn.Linear(d_model, d_model)
# Global attention uses separate projections self.q_global = nn.Linear(d_model, d_model) self.k_global = nn.Linear(d_model, d_model) self.v_global = nn.Linear(d_model, d_model)
self.out_proj = nn.Linear(d_model, d_model) self.dropout = nn.Dropout(dropout)
def sliding_window_attention( self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, attention_mask: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Compute sliding window attention.
Args: query: [batch, n_heads, seq_len, head_dim] key: [batch, n_heads, seq_len, head_dim] value: [batch, n_heads, seq_len, head_dim]
Returns: output: [batch, n_heads, seq_len, head_dim] """ batch, n_heads, seq_len, head_dim = query.shape
# For efficiency, we chunk the sequence # and compute attention within overlapping chunks
# Pad sequence for windowing padding = self.one_sided_window key_padded = F.pad(key, (0, 0, padding, padding)) value_padded = F.pad(value, (0, 0, padding, padding))
# Extract sliding windows using unfold # Shape: [batch, n_heads, seq_len, window_size, head_dim] key_windows = key_padded.unfold(2, self.window_size, 1) value_windows = value_padded.unfold(2, self.window_size, 1)
# Compute attention scores within windows # query: [batch, n_heads, seq_len, 1, head_dim] # key_windows: [batch, n_heads, seq_len, window_size, head_dim] query = query.unsqueeze(-2)
scores = torch.matmul(query, key_windows.transpose(-1, -2)) scores = scores / math.sqrt(head_dim) scores = scores.squeeze(-2) # [batch, n_heads, seq_len, window_size]
if attention_mask is not None: scores = scores + attention_mask
attn_weights = F.softmax(scores, dim=-1) attn_weights = self.dropout(attn_weights)
# Apply attention to values output = torch.matmul(attn_weights.unsqueeze(-2), value_windows) output = output.squeeze(-2)
return outputLinear Complexity Analysis
Memory and Compute Comparison:
Standard Attention (seq_len=4096):┌─────────────────────────────────────────┐│ Attention Matrix: 4096 × 4096 = 16.7M ││ Memory: ~64MB per head (fp32) ││ Compute: O(n²) = O(16.7M) │└─────────────────────────────────────────┘
Longformer (seq_len=4096, window=512):┌─────────────────────────────────────────┐│ Local Attention: 4096 × 512 = 2.1M ││ + Global (100 tokens): 100 × 4096 × 2 ││ = 0.8M ││ Total: ~2.9M operations ││ Memory: ~12MB per head (fp32) ││ Compute: O(n × w) = O(2.1M) │└─────────────────────────────────────────┘
Speedup: ~5.8× faster, ~5× less memory
For longer sequences (seq_len=16384, window=512):Standard: 268M operationsLongformer: 8.4M operationsSpeedup: ~32× faster!Implementation Details
class LongformerLayer(nn.Module): """Single Longformer encoder layer."""
def __init__( self, d_model: int = 768, n_heads: int = 12, window_size: int = 512, dim_feedforward: int = 3072, dropout: float = 0.1, layer_id: int = 0 ): super().__init__()
# Attention with configurable window/dilation per layer self.attention = LongformerAttention( d_model=d_model, n_heads=n_heads, window_size=window_size, dropout=dropout )
# Feedforward network self.ffn = nn.Sequential( nn.Linear(d_model, dim_feedforward), nn.GELU(), nn.Dropout(dropout), nn.Linear(dim_feedforward, d_model), nn.Dropout(dropout) )
self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model)
def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, global_attention_mask: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Forward pass.
Args: hidden_states: [batch, seq_len, d_model] attention_mask: Padding mask global_attention_mask: Which tokens have global attention
Returns: output: [batch, seq_len, d_model] """ # Self-attention with residual residual = hidden_states hidden_states = self.norm1(hidden_states)
attn_output = self.attention( hidden_states, attention_mask=attention_mask, global_attention_mask=global_attention_mask ) hidden_states = residual + attn_output
# FFN with residual residual = hidden_states hidden_states = self.norm2(hidden_states) hidden_states = residual + self.ffn(hidden_states)
return hidden_statesFinancial Applications
Long Financial Document Analysis
Process entire SEC filings, research reports, and earnings transcripts:
class LongformerFinancialNLP(nn.Module): """ Longformer for financial document analysis.
Use cases: - SEC 10-K/10-Q sentiment analysis - Earnings call transcription analysis - Research report summarization - Multi-document question answering """
def __init__( self, vocab_size: int = 50000, d_model: int = 768, n_heads: int = 12, n_layers: int = 12, window_size: int = 512, max_position_embeddings: int = 16384, num_labels: int = 3, # negative, neutral, positive dropout: float = 0.1 ): super().__init__()
self.embeddings = nn.Embedding(vocab_size, d_model) self.position_embeddings = nn.Embedding(max_position_embeddings, d_model)
self.layers = nn.ModuleList([ LongformerLayer( d_model=d_model, n_heads=n_heads, window_size=window_size, dropout=dropout, layer_id=i ) for i in range(n_layers) ])
self.norm = nn.LayerNorm(d_model)
# Task-specific heads self.classifier = nn.Linear(d_model, num_labels)
def forward( self, input_ids: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, global_attention_mask: Optional[torch.Tensor] = None, labels: Optional[torch.Tensor] = None ) -> Dict[str, torch.Tensor]: """ Forward pass for document classification.
Args: input_ids: [batch, seq_len] token IDs attention_mask: [batch, seq_len] padding mask global_attention_mask: [batch, seq_len] global attention mask labels: [batch] classification labels
Returns: Dictionary with loss, logits, etc. """ batch_size, seq_len = input_ids.shape
# Embeddings position_ids = torch.arange(seq_len, device=input_ids.device) hidden_states = self.embeddings(input_ids) hidden_states = hidden_states + self.position_embeddings(position_ids)
# Set global attention on [CLS] token if global_attention_mask is None: global_attention_mask = torch.zeros_like(input_ids) global_attention_mask[:, 0] = 1 # [CLS] token
# Encode for layer in self.layers: hidden_states = layer( hidden_states, attention_mask=attention_mask, global_attention_mask=global_attention_mask )
hidden_states = self.norm(hidden_states)
# Use [CLS] token for classification cls_output = hidden_states[:, 0] logits = self.classifier(cls_output)
outputs = {'logits': logits}
if labels is not None: loss = F.cross_entropy(logits, labels) outputs['loss'] = loss
return outputsExtended Time Series Processing
Apply Longformer architecture to financial time series:
class LongformerTimeSeries(nn.Module): """ Longformer adapted for financial time series.
Key adaptations: - Continuous input embeddings instead of token embeddings - Global attention on recent data and periodic anchors - Multi-scale temporal patterns via dilated windows """
def __init__( self, input_dim: int = 6, # OHLCV + features d_model: int = 256, n_heads: int = 8, n_layers: int = 6, window_size: int = 256, # Local attention window max_seq_len: int = 8192, pred_horizon: int = 24, global_token_freq: int = 256, # Global token every N positions dropout: float = 0.1 ): super().__init__()
self.d_model = d_model self.window_size = window_size self.global_token_freq = global_token_freq
# Input projection self.input_proj = nn.Linear(input_dim, d_model)
# Learnable positional encoding for long sequences self.pos_encoding = nn.Parameter( torch.randn(1, max_seq_len, d_model) * 0.02 )
# Longformer layers with increasing dilation self.layers = nn.ModuleList([ LongformerLayer( d_model=d_model, n_heads=n_heads, window_size=window_size, dilation=min(2 ** (i // 2), 8), # Increase dilation in deeper layers dropout=dropout, layer_id=i ) for i in range(n_layers) ])
self.norm = nn.LayerNorm(d_model)
# Prediction head self.pred_head = nn.Sequential( nn.Linear(d_model, d_model), nn.GELU(), nn.Dropout(dropout), nn.Linear(d_model, pred_horizon) )
def create_global_attention_mask( self, seq_len: int, batch_size: int, device: torch.device ) -> torch.Tensor: """ Create global attention mask for time series.
Strategy: - Last position (most recent) gets global attention - Every global_token_freq positions gets global attention - This creates periodic "anchor" points """ mask = torch.zeros(batch_size, seq_len, device=device)
# Global attention on last position (current time) mask[:, -1] = 1
# Global attention on periodic anchors anchor_positions = torch.arange(0, seq_len, self.global_token_freq) mask[:, anchor_positions] = 1
return mask
def forward(self, x: torch.Tensor) -> torch.Tensor: """ Forward pass.
Args: x: [batch, seq_len, input_dim] time series features
Returns: predictions: [batch, pred_horizon] """ batch_size, seq_len, _ = x.shape
# Project input hidden_states = self.input_proj(x)
# Add positional encoding hidden_states = hidden_states + self.pos_encoding[:, :seq_len]
# Create global attention mask global_attention_mask = self.create_global_attention_mask( seq_len, batch_size, x.device )
# Encode with Longformer layers for layer in self.layers: hidden_states = layer( hidden_states, global_attention_mask=global_attention_mask )
hidden_states = self.norm(hidden_states)
# Use last position for prediction predictions = self.pred_head(hidden_states[:, -1])
return predictionsMulti-Asset Correlation Analysis
Analyze correlations across multiple assets with long history:
class LongformerMultiAsset(nn.Module): """ Multi-asset analysis with Longformer.
Process multiple asset time series jointly with: - Per-asset sliding window attention (temporal patterns) - Cross-asset global attention (correlations) """
def __init__( self, n_assets: int = 10, features_per_asset: int = 6, d_model: int = 256, n_heads: int = 8, n_layers: int = 4, window_size: int = 128, seq_len: int = 2048, dropout: float = 0.1 ): super().__init__()
self.n_assets = n_assets self.seq_len = seq_len self.d_model = d_model
# Per-asset embedding self.asset_embeddings = nn.Embedding(n_assets, d_model) self.input_proj = nn.Linear(features_per_asset, d_model)
# Positional encoding self.pos_encoding = nn.Parameter( torch.randn(1, seq_len, d_model) * 0.02 )
# Longformer layers self.layers = nn.ModuleList([ LongformerLayer( d_model=d_model, n_heads=n_heads, window_size=window_size, dropout=dropout, layer_id=i ) for i in range(n_layers) ])
self.norm = nn.LayerNorm(d_model)
# Output: allocation weights for each asset self.allocation_head = nn.Sequential( nn.Linear(d_model * n_assets, d_model), nn.GELU(), nn.Linear(d_model, n_assets), nn.Softmax(dim=-1) # Portfolio weights sum to 1 )
def forward(self, x: torch.Tensor) -> torch.Tensor: """ Forward pass.
Args: x: [batch, n_assets, seq_len, features_per_asset]
Returns: allocations: [batch, n_assets] portfolio weights """ batch_size = x.shape[0]
# Process each asset asset_outputs = []
for asset_idx in range(self.n_assets): # Get asset data: [batch, seq_len, features] asset_data = x[:, asset_idx]
# Project and add embeddings hidden = self.input_proj(asset_data) hidden = hidden + self.asset_embeddings.weight[asset_idx] hidden = hidden + self.pos_encoding[:, :self.seq_len]
# Create global attention for last timestamp global_mask = torch.zeros(batch_size, self.seq_len, device=x.device) global_mask[:, -1] = 1 # Last position is global
# Encode for layer in self.layers: hidden = layer(hidden, global_attention_mask=global_mask)
hidden = self.norm(hidden)
# Take last position output asset_outputs.append(hidden[:, -1])
# Concatenate all asset representations combined = torch.cat(asset_outputs, dim=-1)
# Predict allocation weights allocations = self.allocation_head(combined)
return allocationsPractical Examples
01: Data Preparation
import numpy as npimport pandas as pdimport requestsfrom typing import List, Tuple, Dict, Optionalfrom transformers import AutoTokenizer
def prepare_financial_documents( documents: List[str], tokenizer_name: str = 'allenai/longformer-base-4096', max_length: int = 4096) -> Dict[str, np.ndarray]: """ Prepare financial documents for Longformer processing.
Args: documents: List of document texts tokenizer_name: HuggingFace tokenizer to use max_length: Maximum sequence length
Returns: Dictionary with input_ids, attention_mask, global_attention_mask """ tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
encodings = tokenizer( documents, max_length=max_length, padding='max_length', truncation=True, return_tensors='np' )
# Create global attention mask (1 for [CLS], 0 elsewhere) global_attention_mask = np.zeros_like(encodings['input_ids']) global_attention_mask[:, 0] = 1
return { 'input_ids': encodings['input_ids'], 'attention_mask': encodings['attention_mask'], 'global_attention_mask': global_attention_mask }
def prepare_timeseries_data( symbols: List[str], lookback: int = 4096, horizon: int = 24, source: str = 'bybit') -> Tuple[np.ndarray, np.ndarray]: """ Prepare long time series data for Longformer.
Args: symbols: Trading pairs (e.g., ['BTCUSDT', 'ETHUSDT']) lookback: Historical sequence length horizon: Prediction horizon source: Data source ('bybit', 'yahoo')
Returns: X: Features [n_samples, lookback, n_features] y: Targets [n_samples, horizon] """ all_features = []
for symbol in symbols: if source == 'bybit': df = load_bybit_data(symbol, interval='1m') else: df = load_yahoo_data(symbol)
# Calculate features df['log_return'] = np.log(df['close'] / df['close'].shift(1)) df['volatility'] = df['log_return'].rolling(20).std() df['volume_ma_ratio'] = df['volume'] / df['volume'].rolling(50).mean() df['price_ma_ratio'] = df['close'] / df['close'].rolling(200).mean() df['rsi'] = calculate_rsi(df['close'], period=14) df['macd'] = calculate_macd(df['close'])
all_features.append(df)
# Combine and create sequences features = pd.concat(all_features, axis=1, keys=symbols) features = features.dropna()
X, y = [], [] feature_cols = ['log_return', 'volatility', 'volume_ma_ratio', 'price_ma_ratio', 'rsi', 'macd']
for i in range(lookback, len(features) - horizon): x_seq = features.iloc[i-lookback:i][ [(s, f) for s in symbols for f in feature_cols] ].values X.append(x_seq)
y_seq = features.iloc[i:i+horizon][(symbols[0], 'log_return')].values y.append(y_seq)
return np.array(X, dtype=np.float32), np.array(y, dtype=np.float32)
def load_bybit_data(symbol: str, interval: str = '1m') -> pd.DataFrame: """Load historical data from Bybit.""" url = "https://api.bybit.com/v5/market/kline"
params = { 'category': 'spot', 'symbol': symbol, 'interval': '1' if interval == '1m' else interval, 'limit': 10000 }
response = requests.get(url, params=params) data = response.json()['result']['list']
df = pd.DataFrame(data, columns=[ 'timestamp', 'open', 'high', 'low', 'close', 'volume', 'turnover' ])
for col in ['open', 'high', 'low', 'close', 'volume', 'turnover']: df[col] = pd.to_numeric(df[col])
df['timestamp'] = pd.to_datetime(df['timestamp'].astype(int), unit='ms') df = df.sort_values('timestamp').reset_index(drop=True)
return df
def calculate_rsi(prices: pd.Series, period: int = 14) -> pd.Series: """Calculate RSI indicator.""" delta = prices.diff() gain = (delta.where(delta > 0, 0)).rolling(window=period).mean() loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean() rs = gain / (loss + 1e-10) return 100 - (100 / (1 + rs))
def calculate_macd(prices: pd.Series, fast: int = 12, slow: int = 26) -> pd.Series: """Calculate MACD indicator.""" ema_fast = prices.ewm(span=fast, adjust=False).mean() ema_slow = prices.ewm(span=slow, adjust=False).mean() return ema_fast - ema_slow02: Longformer Model
import torchimport torch.nn as nnimport torch.nn.functional as Fimport mathfrom typing import Optional, Tuple, Dict
class LongformerSlidingWindowAttention(nn.Module): """ Efficient sliding window attention implementation. """
def __init__( self, d_model: int, n_heads: int, window_size: int = 512, dropout: float = 0.1 ): super().__init__()
assert d_model % n_heads == 0
self.d_model = d_model self.n_heads = n_heads self.head_dim = d_model // n_heads self.window_size = window_size self.one_sided_window = window_size // 2 self.scale = 1.0 / math.sqrt(self.head_dim)
# Local attention projections self.q_proj = nn.Linear(d_model, d_model) self.k_proj = nn.Linear(d_model, d_model) self.v_proj = nn.Linear(d_model, d_model)
# Global attention projections self.q_global = nn.Linear(d_model, d_model) self.k_global = nn.Linear(d_model, d_model) self.v_global = nn.Linear(d_model, d_model)
self.out_proj = nn.Linear(d_model, d_model) self.dropout = nn.Dropout(dropout)
def _compute_local_attention( self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor ) -> torch.Tensor: """ Compute sliding window attention efficiently.
Uses chunked computation for memory efficiency. """ batch, n_heads, seq_len, head_dim = query.shape
# Pad for windowing pad = self.one_sided_window key_padded = F.pad(key, (0, 0, pad, pad), value=0) value_padded = F.pad(value, (0, 0, pad, pad), value=0)
# Create sliding windows using unfold # key_windows: [batch, n_heads, seq_len, window_size, head_dim] key_windows = key_padded.unfold(dimension=2, size=self.window_size, step=1) key_windows = key_windows.transpose(-1, -2)
value_windows = value_padded.unfold(dimension=2, size=self.window_size, step=1) value_windows = value_windows.transpose(-1, -2)
# Compute attention scores # query: [batch, n_heads, seq_len, 1, head_dim] query = query.unsqueeze(-2)
# scores: [batch, n_heads, seq_len, 1, window_size] scores = torch.matmul(query, key_windows.transpose(-1, -2)) * self.scale scores = scores.squeeze(-2) # [batch, n_heads, seq_len, window_size]
attn_weights = F.softmax(scores, dim=-1) attn_weights = self.dropout(attn_weights)
# Apply attention # [batch, n_heads, seq_len, window_size] @ [batch, n_heads, seq_len, window_size, head_dim] attn_weights = attn_weights.unsqueeze(-2) output = torch.matmul(attn_weights, value_windows) output = output.squeeze(-2) # [batch, n_heads, seq_len, head_dim]
return output
def _compute_global_attention( self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, global_mask: torch.Tensor ) -> Tuple[torch.Tensor, torch.Tensor]: """ Compute global attention for marked tokens.
Returns: global_output: Output for global tokens extra_attention: Attention from all tokens to global tokens """ batch, n_heads, seq_len, head_dim = query.shape
# Get global token indices global_indices = global_mask.nonzero(as_tuple=True) n_global = global_mask.sum().item() // batch
if n_global == 0: return torch.zeros_like(query), torch.zeros_like(query)
# Extract global queries, keys, values # For simplicity, we compute full attention for global tokens global_query = query[global_mask.bool().unsqueeze(1).unsqueeze(-1).expand_as(query)] global_query = global_query.view(batch, n_heads, -1, head_dim)
# Global tokens attend to all tokens global_scores = torch.matmul(global_query, key.transpose(-1, -2)) * self.scale global_attn = F.softmax(global_scores, dim=-1) global_attn = self.dropout(global_attn) global_output = torch.matmul(global_attn, value)
# All tokens attend to global tokens global_key = key[global_mask.bool().unsqueeze(1).unsqueeze(-1).expand_as(key)] global_key = global_key.view(batch, n_heads, -1, head_dim)
global_value = value[global_mask.bool().unsqueeze(1).unsqueeze(-1).expand_as(value)] global_value = global_value.view(batch, n_heads, -1, head_dim)
extra_scores = torch.matmul(query, global_key.transpose(-1, -2)) * self.scale extra_attn = F.softmax(extra_scores, dim=-1) extra_attn = self.dropout(extra_attn) extra_output = torch.matmul(extra_attn, global_value)
return global_output, extra_output
def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, global_attention_mask: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Forward pass combining local and global attention. """ batch, seq_len, _ = hidden_states.shape
# Project to Q, K, V Q = self.q_proj(hidden_states) K = self.k_proj(hidden_states) V = self.v_proj(hidden_states)
# Reshape for multi-head attention Q = Q.view(batch, seq_len, self.n_heads, self.head_dim).transpose(1, 2) K = K.view(batch, seq_len, self.n_heads, self.head_dim).transpose(1, 2) V = V.view(batch, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
# Compute local sliding window attention local_output = self._compute_local_attention(Q, K, V)
# Add global attention if specified if global_attention_mask is not None and global_attention_mask.sum() > 0: Q_global = self.q_global(hidden_states) K_global = self.k_global(hidden_states) V_global = self.v_global(hidden_states)
Q_global = Q_global.view(batch, seq_len, self.n_heads, self.head_dim).transpose(1, 2) K_global = K_global.view(batch, seq_len, self.n_heads, self.head_dim).transpose(1, 2) V_global = V_global.view(batch, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
global_output, extra_output = self._compute_global_attention( Q_global, K_global, V_global, global_attention_mask )
# Combine local and global attention output = local_output + extra_output
# Replace global token outputs if global_output.shape[2] > 0: global_indices = global_attention_mask.nonzero(as_tuple=True) output[:, :, global_indices[1]] = global_output else: output = local_output
# Reshape and project output = output.transpose(1, 2).reshape(batch, seq_len, self.d_model) output = self.out_proj(output) output = self.dropout(output)
return output
class LongformerEncoderLayer(nn.Module): """Single Longformer encoder layer."""
def __init__( self, d_model: int = 768, n_heads: int = 12, window_size: int = 512, dim_feedforward: int = 3072, dropout: float = 0.1 ): super().__init__()
self.attention = LongformerSlidingWindowAttention( d_model=d_model, n_heads=n_heads, window_size=window_size, dropout=dropout )
self.ffn = nn.Sequential( nn.Linear(d_model, dim_feedforward), nn.GELU(), nn.Dropout(dropout), nn.Linear(dim_feedforward, d_model), nn.Dropout(dropout) )
self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model)
def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, global_attention_mask: Optional[torch.Tensor] = None ) -> torch.Tensor: # Pre-norm architecture residual = hidden_states hidden_states = self.norm1(hidden_states) hidden_states = self.attention( hidden_states, attention_mask=attention_mask, global_attention_mask=global_attention_mask ) hidden_states = residual + hidden_states
residual = hidden_states hidden_states = self.norm2(hidden_states) hidden_states = self.ffn(hidden_states) hidden_states = residual + hidden_states
return hidden_states
class LongformerForTrading(nn.Module): """ Longformer model for trading applications.
Supports both NLP (document) and time series inputs. """
def __init__( self, input_type: str = 'timeseries', # 'timeseries' or 'text' # Time series parameters input_dim: int = 6, # Text parameters vocab_size: int = 50000, # Shared parameters d_model: int = 256, n_heads: int = 8, n_layers: int = 6, window_size: int = 256, max_seq_len: int = 4096, # Output parameters output_type: str = 'regression', pred_horizon: int = 24, n_classes: int = 3, dropout: float = 0.1 ): super().__init__()
self.input_type = input_type self.output_type = output_type self.d_model = d_model
# Input embedding if input_type == 'timeseries': self.input_proj = nn.Linear(input_dim, d_model) else: self.input_proj = nn.Embedding(vocab_size, d_model)
# Positional encoding self.pos_encoding = nn.Parameter( torch.randn(1, max_seq_len, d_model) * 0.02 )
# Longformer layers self.layers = nn.ModuleList([ LongformerEncoderLayer( d_model=d_model, n_heads=n_heads, window_size=window_size, dim_feedforward=d_model * 4, dropout=dropout ) for _ in range(n_layers) ])
self.norm = nn.LayerNorm(d_model)
# Output head if output_type == 'regression': self.head = nn.Linear(d_model, pred_horizon) elif output_type == 'classification': self.head = nn.Linear(d_model, n_classes) elif output_type == 'allocation': self.head = nn.Sequential( nn.Linear(d_model, pred_horizon), nn.Tanh() )
def forward( self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, global_attention_mask: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Forward pass.
Args: x: Input tensor - Time series: [batch, seq_len, input_dim] - Text: [batch, seq_len] (token IDs) attention_mask: Padding mask global_attention_mask: Global attention positions
Returns: Predictions based on output_type """ batch_size = x.shape[0] seq_len = x.shape[1]
# Input embedding if self.input_type == 'timeseries': hidden_states = self.input_proj(x) else: hidden_states = self.input_proj(x)
# Add positional encoding hidden_states = hidden_states + self.pos_encoding[:, :seq_len]
# Default global attention on last position if global_attention_mask is None: global_attention_mask = torch.zeros(batch_size, seq_len, device=x.device) global_attention_mask[:, -1] = 1 # Last position if self.input_type == 'text': global_attention_mask[:, 0] = 1 # [CLS] token
# Encode for layer in self.layers: hidden_states = layer( hidden_states, attention_mask=attention_mask, global_attention_mask=global_attention_mask )
hidden_states = self.norm(hidden_states)
# Use last position for prediction output = self.head(hidden_states[:, -1])
return output03: Training Pipeline
import torchimport torch.nn as nnfrom torch.utils.data import DataLoader, TensorDatasetimport numpy as npfrom typing import Dict, Optional, Listimport loggingfrom pathlib import Path
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
class LongformerTrainer: """Training pipeline for Longformer trading model."""
def __init__( self, model: nn.Module, learning_rate: float = 1e-4, weight_decay: float = 1e-5, warmup_steps: int = 1000, device: str = 'cuda' if torch.cuda.is_available() else 'cpu' ): self.model = model.to(device) self.device = device
self.optimizer = torch.optim.AdamW( model.parameters(), lr=learning_rate, weight_decay=weight_decay, betas=(0.9, 0.999) )
self.warmup_steps = warmup_steps self.current_step = 0
# Loss based on output type if model.output_type == 'regression': self.criterion = nn.MSELoss() elif model.output_type == 'classification': self.criterion = nn.CrossEntropyLoss() elif model.output_type == 'allocation': self.criterion = self._sharpe_loss
def _get_lr_multiplier(self) -> float: """Linear warmup then constant.""" if self.current_step < self.warmup_steps: return self.current_step / self.warmup_steps return 1.0
def _sharpe_loss( self, allocations: torch.Tensor, returns: torch.Tensor ) -> torch.Tensor: """Differentiable Sharpe ratio loss.""" portfolio_returns = allocations * returns mean_ret = portfolio_returns.mean() std_ret = portfolio_returns.std() + 1e-8 return -mean_ret / std_ret
def train_epoch( self, dataloader: DataLoader, global_attention_mode: str = 'last' ) -> float: """Train for one epoch.""" self.model.train() total_loss = 0.0
for batch_x, batch_y in dataloader: batch_x = batch_x.to(self.device) batch_y = batch_y.to(self.device)
# Adjust learning rate lr_mult = self._get_lr_multiplier() for param_group in self.optimizer.param_groups: param_group['lr'] = param_group['lr'] * lr_mult
self.optimizer.zero_grad()
# Create global attention mask batch_size, seq_len = batch_x.shape[0], batch_x.shape[1] global_mask = self._create_global_mask( batch_size, seq_len, global_attention_mode )
predictions = self.model(batch_x, global_attention_mask=global_mask) loss = self.criterion(predictions, batch_y)
loss.backward() torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0) self.optimizer.step()
total_loss += loss.item() self.current_step += 1
return total_loss / len(dataloader)
def _create_global_mask( self, batch_size: int, seq_len: int, mode: str = 'last' ) -> torch.Tensor: """Create global attention mask.""" mask = torch.zeros(batch_size, seq_len, device=self.device)
if mode == 'last': mask[:, -1] = 1 elif mode == 'periodic': # Every 256 positions positions = torch.arange(0, seq_len, 256) mask[:, positions] = 1 mask[:, -1] = 1 elif mode == 'first_last': mask[:, 0] = 1 mask[:, -1] = 1
return mask
@torch.no_grad() def evaluate(self, dataloader: DataLoader) -> Dict[str, float]: """Evaluate model performance.""" self.model.eval()
all_preds = [] all_targets = [] total_loss = 0.0
for batch_x, batch_y in dataloader: batch_x = batch_x.to(self.device) batch_y = batch_y.to(self.device)
predictions = self.model(batch_x) loss = self.criterion(predictions, batch_y)
total_loss += loss.item() all_preds.append(predictions.cpu()) all_targets.append(batch_y.cpu())
all_preds = torch.cat(all_preds) all_targets = torch.cat(all_targets)
metrics = {'loss': total_loss / len(dataloader)}
if self.model.output_type == 'regression': mse = ((all_preds - all_targets) ** 2).mean().item() mae = (all_preds - all_targets).abs().mean().item() metrics['mse'] = mse metrics['mae'] = mae
# Directional accuracy pred_dir = (all_preds[:, 0] > 0).float() true_dir = (all_targets[:, 0] > 0).float() metrics['direction_accuracy'] = (pred_dir == true_dir).float().mean().item()
elif self.model.output_type == 'classification': pred_classes = all_preds.argmax(dim=1) metrics['accuracy'] = (pred_classes == all_targets).float().mean().item()
return metrics
def train( self, train_loader: DataLoader, val_loader: DataLoader, num_epochs: int = 100, patience: int = 10, save_dir: str = 'checkpoints' ) -> Dict[str, List]: """Full training loop.""" save_path = Path(save_dir) save_path.mkdir(exist_ok=True)
best_val_loss = float('inf') patience_counter = 0 history = {'train_loss': [], 'val_loss': [], 'val_metrics': []}
for epoch in range(num_epochs): train_loss = self.train_epoch(train_loader) val_metrics = self.evaluate(val_loader)
history['train_loss'].append(train_loss) history['val_loss'].append(val_metrics['loss']) history['val_metrics'].append(val_metrics)
logger.info( f"Epoch {epoch+1}/{num_epochs} | " f"Train Loss: {train_loss:.6f} | " f"Val Loss: {val_metrics['loss']:.6f} | " f"Val Metrics: {val_metrics}" )
if val_metrics['loss'] < best_val_loss: best_val_loss = val_metrics['loss'] patience_counter = 0 torch.save( self.model.state_dict(), save_path / 'longformer_best.pt' ) logger.info(f"Saved best model") else: patience_counter += 1
if patience_counter >= patience: logger.info(f"Early stopping at epoch {epoch+1}") break
return history
def main(): """Example training script.""" from data import prepare_timeseries_data from model import LongformerForTrading
# Load data logger.info("Loading data from Bybit...") X, y = prepare_timeseries_data( symbols=['BTCUSDT', 'ETHUSDT'], lookback=4096, horizon=24, source='bybit' )
# Split train_size = int(0.7 * len(X)) val_size = int(0.15 * len(X))
X_train, y_train = X[:train_size], y[:train_size] X_val, y_val = X[train_size:train_size+val_size], y[train_size:train_size+val_size]
# Dataloaders train_loader = DataLoader( TensorDataset(torch.tensor(X_train), torch.tensor(y_train)), batch_size=16, shuffle=True ) val_loader = DataLoader( TensorDataset(torch.tensor(X_val), torch.tensor(y_val)), batch_size=16 )
# Model model = LongformerForTrading( input_type='timeseries', input_dim=X.shape[-1], d_model=256, n_heads=8, n_layers=6, window_size=256, max_seq_len=4096, output_type='regression', pred_horizon=24 )
logger.info(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
# Train trainer = LongformerTrainer(model, learning_rate=1e-4) history = trainer.train( train_loader, val_loader, num_epochs=100, patience=15 )
logger.info("Training completed!")
if __name__ == '__main__': main()04: Backtesting Strategy
import numpy as npimport pandas as pdimport torchfrom typing import Dict, List, Optionalfrom dataclasses import dataclassimport matplotlib.pyplot as plt
@dataclassclass BacktestConfig: """Backtest configuration.""" initial_capital: float = 100000.0 transaction_cost: float = 0.001 slippage: float = 0.0005 max_position: float = 1.0 confidence_threshold: float = 0.6
class LongformerBacktester: """Backtesting for Longformer trading strategy."""
def __init__( self, model: torch.nn.Module, config: BacktestConfig = BacktestConfig() ): self.model = model self.config = config self.model.eval()
@torch.no_grad() def generate_signals( self, data: np.ndarray, threshold: float = 0.001 ) -> np.ndarray: """Generate trading signals.""" x = torch.tensor(data, dtype=torch.float32) predictions = self.model(x).numpy()
pred_returns = predictions[:, 0] if len(predictions.shape) > 1 else predictions
signals = np.zeros_like(pred_returns) signals[pred_returns > threshold] = 1 signals[pred_returns < -threshold] = -1
return signals
def run_backtest( self, data: np.ndarray, prices: np.ndarray, timestamps: pd.DatetimeIndex ) -> Dict: """Run backtest simulation.""" signals = self.generate_signals(data)
capital = self.config.initial_capital position = 0.0
equity_curve = [capital] positions = [0.0] returns = [] trades = []
for i in range(len(signals)): current_price = prices[i] signal = signals[i]
target_position = signal * self.config.max_position position_change = target_position - position
if abs(position_change) > 0.01: trade_value = abs(position_change) * capital costs = trade_value * (self.config.transaction_cost + self.config.slippage) capital -= costs
trades.append({ 'timestamp': timestamps[i], 'price': current_price, 'signal': signal, 'position_change': position_change, 'costs': costs })
position = target_position
if i > 0 and position != 0: price_return = (current_price - prices[i-1]) / prices[i-1] pnl = position * capital * price_return capital += pnl returns.append(pnl / equity_curve[-1]) else: returns.append(0.0)
equity_curve.append(capital) positions.append(position)
returns = np.array(returns) equity_curve = np.array(equity_curve)
metrics = self._calculate_metrics(returns, equity_curve, trades)
return { 'equity_curve': equity_curve, 'positions': positions, 'returns': returns, 'trades': trades, 'timestamps': timestamps, 'metrics': metrics }
def _calculate_metrics( self, returns: np.ndarray, equity_curve: np.ndarray, trades: List[Dict] ) -> Dict: """Calculate performance metrics.""" total_return = (equity_curve[-1] - equity_curve[0]) / equity_curve[0]
if len(returns) > 0 and returns.std() > 0: sharpe_ratio = np.sqrt(252) * returns.mean() / returns.std() else: sharpe_ratio = 0.0
downside_returns = returns[returns < 0] if len(downside_returns) > 0: sortino_ratio = np.sqrt(252) * returns.mean() / (downside_returns.std() + 1e-8) else: sortino_ratio = sharpe_ratio
peak = np.maximum.accumulate(equity_curve) drawdown = (equity_curve - peak) / peak max_drawdown = drawdown.min()
calmar_ratio = total_return / abs(max_drawdown) if max_drawdown < 0 else float('inf')
return { 'total_return': total_return, 'sharpe_ratio': sharpe_ratio, 'sortino_ratio': sortino_ratio, 'max_drawdown': max_drawdown, 'calmar_ratio': calmar_ratio, 'num_trades': len(trades), 'final_capital': equity_curve[-1] }
def plot_results(self, results: Dict, save_path: Optional[str] = None): """Visualize backtest results.""" fig, axes = plt.subplots(3, 1, figsize=(14, 10))
# Equity curve axes[0].plot(results['timestamps'], results['equity_curve'][1:], 'b-', linewidth=1) axes[0].set_title('Equity Curve (Longformer Strategy)') axes[0].set_ylabel('Capital ($)') axes[0].grid(True, alpha=0.3)
# Positions axes[1].fill_between( results['timestamps'], results['positions'][1:], 0, where=np.array(results['positions'][1:]) > 0, color='green', alpha=0.5, label='Long' ) axes[1].fill_between( results['timestamps'], results['positions'][1:], 0, where=np.array(results['positions'][1:]) < 0, color='red', alpha=0.5, label='Short' ) axes[1].set_title('Position Size') axes[1].set_ylabel('Position') axes[1].legend() axes[1].grid(True, alpha=0.3)
# Drawdown equity = np.array(results['equity_curve'][1:]) peak = np.maximum.accumulate(equity) drawdown = (equity - peak) / peak * 100 axes[2].fill_between(results['timestamps'], drawdown, 0, color='red', alpha=0.5) axes[2].set_title('Drawdown') axes[2].set_ylabel('Drawdown (%)') axes[2].grid(True, alpha=0.3)
plt.tight_layout()
if save_path: plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.show()
# Print metrics print("\n" + "="*50) print("LONGFORMER BACKTEST RESULTS") print("="*50) for metric, value in results['metrics'].items(): if isinstance(value, float): print(f"{metric:20s}: {value:>12.4f}") else: print(f"{metric:20s}: {value:>12}") print("="*50)
def main(): """Run backtest example.""" from model import LongformerForTrading from data import prepare_timeseries_data, load_bybit_data
# Load model model = LongformerForTrading( input_type='timeseries', input_dim=12, d_model=256, n_heads=8, n_layers=6, window_size=256, max_seq_len=4096, output_type='regression', pred_horizon=24 ) model.load_state_dict(torch.load('checkpoints/longformer_best.pt'))
# Load test data X, y = prepare_timeseries_data( symbols=['BTCUSDT', 'ETHUSDT'], lookback=4096, horizon=24, source='bybit' )
price_data = load_bybit_data('BTCUSDT')
test_start = int(0.8 * len(X)) X_test = X[test_start:] prices = price_data['close'].values[4096+test_start:4096+test_start+len(X_test)] timestamps = pd.DatetimeIndex( price_data['timestamp'].values[4096+test_start:4096+test_start+len(X_test)] )
# Run backtest backtester = LongformerBacktester(model) results = backtester.run_backtest(X_test, prices, timestamps) backtester.plot_results(results, save_path='longformer_backtest.png')
if __name__ == '__main__': main()Rust Implementation
See rust/ for complete Rust implementation.
rust/├── Cargo.toml├── README.md├── src/│ ├── lib.rs│ ├── api/│ │ ├── mod.rs│ │ ├── bybit.rs│ │ └── types.rs│ ├── attention/│ │ ├── mod.rs│ │ ├── sliding_window.rs│ │ └── global.rs│ ├── model/│ │ ├── mod.rs│ │ ├── longformer.rs│ │ └── encoder.rs│ └── strategy/│ ├── mod.rs│ ├── signals.rs│ └── backtest.rs└── examples/ ├── fetch_data.rs ├── train.rs └── backtest.rsQuick Start (Rust)
cd rust
# Fetch data from Bybitcargo run --example fetch_data -- --symbols BTCUSDT,ETHUSDT
# Train modelcargo run --example train -- --epochs 50 --batch-size 16
# Run backtestcargo run --example backtest -- --start 2024-01-01 --end 2024-12-31Python Implementation
See python/ for Python implementation.
python/├── __init__.py├── model.py # Longformer implementation├── data.py # Data loading (Bybit, Yahoo)├── train.py # Training pipeline├── strategy.py # Backtesting├── requirements.txt└── examples/ └── longformer_trading.ipynbQuick Start (Python)
cd pythonpip install -r requirements.txt
# Fetch and trainpython train.py --symbols BTCUSDT,ETHUSDT --epochs 100
# Backtestpython strategy.py --model checkpoints/longformer_best.ptFinancial NLP Applications
Document-Level Sentiment Classification
The primary financial NLP task for Longformer is classifying the sentiment of entire documents — earnings call transcripts, analyst reports, or news articles — without truncation. Given a document tokenized into ${x_1, x_2, \ldots, x_n}$ with $n \leq 4096$, the Longformer encodes the sequence using its mixed attention pattern. The [CLS] token representation $\mathbf{h}_{\text{CLS}} \in \mathbb{R}^d$ is fed through a classification head:
$$\hat{y} = \text{softmax}(\mathbf{W}\mathbf{h}_{\text{CLS}} + \mathbf{b})$$
Classes typically include positive, negative, and neutral sentiment.
Named Entity Recognition in Financial Documents
Longformer excels at financial NER because entity context often spans long distances. For example, a company name mentioned on page one of a 10-K may be relevant to a risk factor discussed on page ten. The model uses sliding window attention for local entity detection and global attention on section headers to maintain document-level context.
Each token $x_i$ is classified into BIO tags:
$$\hat{y}i = \text{softmax}(\mathbf{W}{\text{ner}}\mathbf{h}i + \mathbf{b}{\text{ner}})$$
Entity types relevant to finance include: ORG (organizations), MONEY (monetary amounts), PERCENT (percentages), DATE (dates), PRODUCT (financial instruments), and EVENT (market events).
Risk Factor Detection
Regulatory filings contain risk disclosures that are material to investment decisions. Longformer can classify each paragraph or section as containing a risk factor or not:
$$P(\text{risk} | \text{paragraph}_i) = \sigma(\mathbf{w}^T \mathbf{h}_i + b)$$
where $\mathbf{h}_i$ is the Longformer representation of the paragraph.
Pre-training and Fine-tuning Pipeline
Longformer is initialized from RoBERTa checkpoints and continues pre-training on long documents:
- Pre-training: Start from a pre-trained Longformer checkpoint (e.g.,
allenai/longformer-base-4096). - Domain adaptation: Continue pre-training on a financial corpus (SEC filings, financial news, analyst reports) using masked language modeling.
- Fine-tuning: Train on the downstream task (classification, NER, sentiment) with task-specific heads.
Position embeddings are extended from 512 to 4,096 by replicating the existing embeddings.
Feature Engineering for Financial NLP
Token-Level Features:
- Financial numbers: Monetary values, percentages, ratios should be normalized
- Temporal expressions: Fiscal quarters, year-over-year references
- Legal/regulatory terms: “Material adverse effect”, “going concern”
- Sentiment indicators: “Exceeded expectations”, “headwinds”, “robust growth”
Document Structure Features:
- Section headers: Map to global attention tokens for cross-section information routing
- Table markers: Financial tables contain critical quantitative data
- Footnote references: Connect footnote content to main text across long distances
Aggregated Sentiment Features for Trading:
- Sentiment score: Continuous value in $[-1, 1]$ from the classification head
- Sentiment momentum: Change in sentiment across consecutive documents $\Delta S_t = S_t - S_{t-1}$
- Cross-document consensus: Average sentiment across multiple sources
- Sentiment surprise: Deviation from expected sentiment
Specific NLP Application Areas
Earnings Call Analysis: Transcripts are 5,000-15,000 words. Longformer can perform full-transcript sentiment analysis, section-level tone comparison (CEO vs CFO remarks vs Q&A), and forward-looking statement detection.
Regulatory Filing Analysis: 10-K and 10-Q filings contain critical information: risk factor extraction, Management Discussion & Analysis interpretation, and material event detection.
Crypto Market News Analysis: For cryptocurrency markets: whitepaper technical evaluation, governance proposal impact assessment, and multi-source thread aggregation for long-context analysis.
Evaluation Metrics for NLP Tasks
- Classification: Accuracy, F1-score (macro and weighted), AUC-ROC
- NER: Entity-level F1, precision, recall
- Sentiment: Cohen’s Kappa, directional accuracy for trading signals
- Trading performance: Sharpe ratio, Sortino ratio, maximum drawdown
Best Practices
When to Use Longformer
Ideal use cases:
- Long documents: SEC filings, research reports, earnings transcripts
- Extended time series: 4K-16K timesteps
- Tasks requiring both local and global context
- Document QA with long source texts
Consider alternatives when:
- Short sequences (<512): Standard transformer is fine
- Pure local patterns: Sliding window alone may suffice
- Very sparse global dependencies: Use sparse attention
Hyperparameter Guidelines
| Parameter | Recommended | Notes |
|---|---|---|
window_size | 256-512 | Larger for more local context |
max_seq_len | 4096-16384 | Based on document/series length |
n_layers | 6-12 | More for complex patterns |
global_attention | Task-specific | [CLS] for classification, recent for time series |
Window Size Selection
Rule of thumb for window size:
Document length → Window size- 4096 tokens → 256-512- 8192 tokens → 512- 16384 tokens → 512-1024
For time series (minute data):- 1 day (1440 min) → 128-256- 1 week (10080 min) → 256-512- 1 month (43200 min) → 512-1024Resources
Papers
- Longformer: The Long-Document Transformer — Original paper (2020)
- Big Bird: Transformers for Longer Sequences — Related architecture
- Efficient Transformers: A Survey — Comprehensive overview
Implementations
- HuggingFace Transformers — Official implementation
- AllenAI Longformer — Original repository
Additional NLP References
- Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063.
- Huang, A. H., Wang, H., & Yang, Y. (2023). FinBERT: A Large Language Model for Extracting Information from Financial Text. Contemporary Accounting Research, 40(2), 806-841.
- Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35-65.
- Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
Related Chapters
- Chapter 51: Linformer Long Sequences — Linear attention
- Chapter 52: Performer Efficient Attention — Random features
- Chapter 53: BigBird Sparse Attention — Similar sparse approach
- Chapter 56: Nystromformer Trading — Nyström approximation
Difficulty Level
Intermediate
Prerequisites:
- Understanding of transformer self-attention
- Basic knowledge of NLP and tokenization
- Familiarity with time series forecasting
- PyTorch or Rust ML library experience
Sources: