Chapter 48: Positional Encoding for Time Series

This chapter explores Positional Encoding techniques specifically designed for time series and financial data. Unlike standard NLP transformers, time series require specialized encodings that capture temporal patterns, periodicity, and financial market dynamics.

Introduction to Positional Encoding
Sinusoidal Positional Encoding
Learned Positional Encoding
- Trainable Embeddings
- Advantages for Financial Data
Relative Positional Encoding
- Shaw’s Relative Attention
- XLNet Style Encoding
Rotary Position Embedding (RoPE)
- Mathematical Formulation
- RoPE for Time Series
Temporal Encodings for Finance
Practical Examples
Rust Implementation
Python Implementation
Best Practices
Resources

Introduction to Positional Encoding

Transformers process sequences without inherent order awareness. Unlike RNNs that process tokens sequentially, self-attention treats all positions equally. Positional encoding injects position information into the model.

Why Position Matters

For time series, position carries critical information:

Without position:  [100, 105, 103, 108, 102] = Price values (unordered set)
With position:     t=1: 100 → t=2: 105 → t=3: 103 → t=4: 108 → t=5: 102

The sequence tells a story:
- Price went UP from 100 to 105 (+5%)
- Then DOWN to 103 (-2%)
- Then UP to 108 (+5%)
- Then DOWN to 102 (-6%)

Position determines meaning:

[100, 105] = bullish trend (price rising)
[105, 100] = bearish trend (price falling)

Time Series Challenges

Financial time series have unique characteristics:

Challenge	Description	Solution
Variable lengths	Different prediction horizons	Relative encodings
Multiple time scales	Minutes, hours, days, weeks	Multi-scale encoding
Periodicity	Daily/weekly/monthly patterns	Sinusoidal encoding
Non-stationarity	Market regime changes	Learned + contextual encoding
Missing data	Market holidays, gaps	Masked position encoding

Types of Positional Encoding

┌────────────────────────────────────────────────────────────────┐
│                 POSITIONAL ENCODING TYPES                       │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. SINUSOIDAL (Fixed)                                         │
│     ├── No learnable parameters                                │
│     ├── Extrapolates to unseen lengths                         │
│     └── PE(pos, 2i) = sin(pos / 10000^(2i/d))                 │
│                                                                 │
│  2. LEARNED (Trainable)                                        │
│     ├── Embedding table for each position                      │
│     ├── Adapts to data patterns                                │
│     └── Limited to training sequence length                    │
│                                                                 │
│  3. RELATIVE (Shaw, T5)                                        │
│     ├── Encodes distance between tokens                        │
│     ├── Good for varying sequence lengths                      │
│     └── att(i,j) depends on (i-j)                             │
│                                                                 │
│  4. ROTARY (RoPE)                                              │
│     ├── Rotates query/key vectors                              │
│     ├── Relative position via rotation                         │
│     └── Used in LLaMA, GPT-NeoX                               │
│                                                                 │
│  5. TEMPORAL (Time Series Specific)                            │
│     ├── Calendar features (day, month, year)                   │
│     ├── Trading session indicators                             │
│     └── Multi-frequency components                             │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Sinusoidal Positional Encoding

The original Transformer uses sinusoidal functions to encode position:

Mathematical Foundation

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:
- pos = position in sequence (0, 1, 2, ...)
- i = dimension index (0, 1, ..., d_model/2 - 1)
- d_model = embedding dimension

Why sine/cosine?

Bounded values: Always between [-1, 1]
Uniqueness: Each position has unique encoding
Relative position: PE(pos+k) can be represented as linear function of PE(pos)
Extrapolation: Works for sequences longer than training

Sinusoidal Implementation

import torch
import torch.nn as nn
import math

class SinusoidalPositionalEncoding(nn.Module):
    """
    Standard sinusoidal positional encoding from 'Attention Is All You Need'

    For time series, we extend this with:
    - Optional scaling for different time scales
    - Temperature parameter for frequency control
    """

    def __init__(
        self,
        d_model: int,
        max_len: int = 5000,
        dropout: float = 0.1,
        temperature: float = 10000.0
    ):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.d_model = d_model

        # Create position encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)

        # Frequency terms
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() *
            (-math.log(temperature) / d_model)
        )

        # Interleave sin and cos
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Register as buffer (not a parameter)
        pe = pe.unsqueeze(0)  # [1, max_len, d_model]
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor [batch, seq_len, d_model]
        Returns:
            Position-encoded tensor [batch, seq_len, d_model]
        """
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

Time Series Adaptations

For financial time series, we can adapt sinusoidal encoding:

class TimeSeriesSinusoidalEncoding(nn.Module):
    """
    Sinusoidal encoding adapted for time series with multiple frequencies

    Captures:
    - Intraday patterns (hourly cycles)
    - Daily patterns (market open/close)
    - Weekly patterns (Monday effect)
    - Monthly patterns (month-end rebalancing)
    """

    def __init__(
        self,
        d_model: int,
        frequencies: list = [24, 24*7, 24*30, 24*365],  # Hourly data periods
        dropout: float = 0.1
    ):
        super().__init__()
        self.d_model = d_model
        self.frequencies = frequencies
        self.dropout = nn.Dropout(p=dropout)

        # Allocate dimensions to each frequency
        dims_per_freq = d_model // (len(frequencies) * 2)
        self.dims_per_freq = dims_per_freq

    def forward(self, x: torch.Tensor, timestamps: torch.Tensor = None) -> torch.Tensor:
        """
        Args:
            x: Input tensor [batch, seq_len, d_model]
            timestamps: Optional absolute timestamps [batch, seq_len]
        Returns:
            Position-encoded tensor [batch, seq_len, d_model]
        """
        batch, seq_len, d_model = x.shape
        device = x.device

        if timestamps is None:
            # Use sequential positions
            positions = torch.arange(seq_len, device=device).float()
        else:
            positions = timestamps.float()

        pe = torch.zeros(seq_len, d_model, device=device)

        dim_idx = 0
        for freq in self.frequencies:
            # Normalize position to frequency period
            pos_normalized = positions / freq

            for i in range(self.dims_per_freq):
                # Multiple harmonics per frequency
                harmonic = 2 ** i
                pe[:, dim_idx] = torch.sin(2 * math.pi * harmonic * pos_normalized)
                pe[:, dim_idx + 1] = torch.cos(2 * math.pi * harmonic * pos_normalized)
                dim_idx += 2

        # Fill remaining dimensions with standard sinusoidal
        if dim_idx < d_model:
            position = torch.arange(seq_len, device=device).unsqueeze(1)
            remaining_dims = d_model - dim_idx
            div_term = torch.exp(
                torch.arange(0, remaining_dims, 2, device=device).float() *
                (-math.log(10000.0) / remaining_dims)
            )
            pe[:, dim_idx::2] = torch.sin(position * div_term)
            pe[:, dim_idx+1::2] = torch.cos(position * div_term)

        x = x + pe.unsqueeze(0)
        return self.dropout(x)

Learned Positional Encoding

Instead of fixed functions, learn position embeddings from data:

Trainable Embeddings

class LearnedPositionalEncoding(nn.Module):
    """
    Learned positional encoding using embedding table

    Advantages:
    - Adapts to dataset-specific patterns
    - Can learn asymmetric dependencies
    - Works well with fixed-length sequences

    Disadvantages:
    - Cannot extrapolate beyond training length
    - More parameters to optimize
    """

    def __init__(
        self,
        d_model: int,
        max_len: int = 512,
        dropout: float = 0.1,
        init_std: float = 0.02
    ):
        super().__init__()
        self.embedding = nn.Embedding(max_len, d_model)
        self.dropout = nn.Dropout(p=dropout)

        # Initialize with small values
        nn.init.normal_(self.embedding.weight, std=init_std)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor [batch, seq_len, d_model]
        Returns:
            Position-encoded tensor [batch, seq_len, d_model]
        """
        batch, seq_len, _ = x.shape
        positions = torch.arange(seq_len, device=x.device)
        pos_emb = self.embedding(positions)  # [seq_len, d_model]

        x = x + pos_emb.unsqueeze(0)
        return self.dropout(x)

Advantages for Financial Data

Learned encodings can capture:

Recency bias: Recent prices matter more
Asymmetric lookback: Different weights for different lags
Non-linear decay: Custom attention patterns over time

class FinancialLearnedEncoding(nn.Module):
    """
    Learned encoding with financial market priors

    Incorporates:
    - Recency weighting (recent data more important)
    - Multi-scale time embeddings
    - Weekday/hour embeddings for market patterns
    """

    def __init__(
        self,
        d_model: int,
        max_len: int = 512,
        n_weekdays: int = 7,
        n_hours: int = 24,
        dropout: float = 0.1
    ):
        super().__init__()

        # Positional embedding
        self.pos_embedding = nn.Embedding(max_len, d_model // 2)

        # Temporal embeddings
        self.weekday_embedding = nn.Embedding(n_weekdays, d_model // 4)
        self.hour_embedding = nn.Embedding(n_hours, d_model // 4)

        # Recency decay (learnable)
        self.decay = nn.Parameter(torch.ones(1))

        self.dropout = nn.Dropout(p=dropout)
        self.d_model = d_model

    def forward(
        self,
        x: torch.Tensor,
        weekdays: torch.Tensor = None,
        hours: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Args:
            x: Input [batch, seq_len, d_model]
            weekdays: Weekday indices [batch, seq_len] (0=Monday, 6=Sunday)
            hours: Hour indices [batch, seq_len] (0-23)
        """
        batch, seq_len, d_model = x.shape
        device = x.device

        # Position embedding
        positions = torch.arange(seq_len, device=device)
        pos_emb = self.pos_embedding(positions)  # [seq_len, d_model//2]

        # Apply recency decay
        decay_weights = torch.exp(-self.decay * (seq_len - 1 - positions.float()) / seq_len)
        pos_emb = pos_emb * decay_weights.unsqueeze(-1)

        # Temporal embeddings (if provided)
        if weekdays is not None and hours is not None:
            weekday_emb = self.weekday_embedding(weekdays)  # [batch, seq_len, d_model//4]
            hour_emb = self.hour_embedding(hours)           # [batch, seq_len, d_model//4]

            # Combine all embeddings
            combined = torch.cat([
                pos_emb.unsqueeze(0).expand(batch, -1, -1),
                weekday_emb,
                hour_emb
            ], dim=-1)
        else:
            # Pad with zeros if temporal info not available
            combined = torch.cat([
                pos_emb.unsqueeze(0).expand(batch, -1, -1),
                torch.zeros(batch, seq_len, d_model // 2, device=device)
            ], dim=-1)

        x = x + combined
        return self.dropout(x)

Relative Positional Encoding

Encodes the distance between positions rather than absolute positions:

Shaw’s Relative Attention

class RelativePositionalEncoding(nn.Module):
    """
    Relative positional encoding from Shaw et al.

    Instead of adding position to input, modify attention scores:
    attention(Q, K) = softmax((Q @ K^T + Q @ R^T) / sqrt(d))

    Where R is relative position embedding
    """

    def __init__(
        self,
        d_model: int,
        n_heads: int,
        max_relative_position: int = 128,
        dropout: float = 0.1
    ):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.max_relative_position = max_relative_position

        # Relative position embeddings: [-max_pos, ..., 0, ..., max_pos]
        self.relative_embedding = nn.Embedding(
            2 * max_relative_position + 1,
            self.head_dim
        )

        self.dropout = nn.Dropout(p=dropout)

    def forward(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            q: Query tensor [batch, n_heads, seq_len, head_dim]
            k: Key tensor [batch, n_heads, seq_len, head_dim]
            v: Value tensor [batch, n_heads, seq_len, head_dim]
        Returns:
            Attention output [batch, n_heads, seq_len, head_dim]
        """
        batch, n_heads, seq_len, head_dim = q.shape
        device = q.device

        # Standard QK attention scores
        qk_scores = torch.matmul(q, k.transpose(-2, -1))  # [batch, heads, seq, seq]

        # Compute relative position indices
        positions = torch.arange(seq_len, device=device)
        relative_positions = positions.unsqueeze(0) - positions.unsqueeze(1)  # [seq, seq]
        relative_positions = relative_positions.clamp(
            -self.max_relative_position,
            self.max_relative_position
        )
        relative_positions = relative_positions + self.max_relative_position  # Shift to positive

        # Get relative embeddings
        rel_emb = self.relative_embedding(relative_positions)  # [seq, seq, head_dim]

        # Compute relative attention scores: Q @ R^T
        q_expanded = q.unsqueeze(3)  # [batch, heads, seq, 1, head_dim]
        rel_emb_expanded = rel_emb.unsqueeze(0).unsqueeze(0)  # [1, 1, seq, seq, head_dim]

        relative_scores = torch.matmul(
            q_expanded,
            rel_emb_expanded.transpose(-2, -1)
        ).squeeze(-2)  # [batch, heads, seq, seq]

        # Combine scores
        scores = (qk_scores + relative_scores) / math.sqrt(head_dim)
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Apply attention to values
        output = torch.matmul(attn_weights, v)

        return output

XLNet Style Encoding

XLNet uses a more sophisticated relative encoding:

class XLNetRelativeEncoding(nn.Module):
    """
    XLNet-style relative positional encoding

    Uses learnable parameters for:
    - Content-based attention (c)
    - Position-based attention (p)
    - Segment-based attention (optional)
    """

    def __init__(self, d_model: int, n_heads: int, max_len: int = 512):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads

        # Relative position embedding
        self.r_embedding = nn.Embedding(max_len, d_model)

        # Content and position biases
        self.u = nn.Parameter(torch.zeros(n_heads, self.head_dim))
        self.v = nn.Parameter(torch.zeros(n_heads, self.head_dim))

        # Initialize
        nn.init.normal_(self.u, std=0.02)
        nn.init.normal_(self.v, std=0.02)

    def forward(self, q: torch.Tensor, k: torch.Tensor) -> torch.Tensor:
        """
        Compute relative attention bias

        Args:
            q: Query [batch, n_heads, seq_len, head_dim]
            k: Key [batch, n_heads, seq_len, head_dim]
        Returns:
            Attention bias to add to scores
        """
        batch, n_heads, seq_len, head_dim = q.shape
        device = q.device

        # Get relative positions
        positions = torch.arange(seq_len, device=device)
        rel_pos = positions.unsqueeze(0) - positions.unsqueeze(1)  # [seq, seq]
        rel_pos = rel_pos.abs().clamp(max=seq_len-1)

        # Get relative embeddings
        r = self.r_embedding(rel_pos)  # [seq, seq, d_model]
        r = r.view(seq_len, seq_len, n_heads, head_dim)
        r = r.permute(2, 0, 1, 3)  # [n_heads, seq, seq, head_dim]

        # Content attention: (q + u) @ k^T
        q_u = q + self.u.unsqueeze(0).unsqueeze(2)
        content_attn = torch.matmul(q_u, k.transpose(-2, -1))

        # Position attention: (q + v) @ r^T
        q_v = q + self.v.unsqueeze(0).unsqueeze(2)
        position_attn = torch.einsum('bhid,hjid->bhij', q_v, r)

        return content_attn + position_attn

Rotary Position Embedding (RoPE)

RoPE encodes position by rotating query and key vectors:

RoPE Mathematical Formulation

For query q and key k at positions m and n:
RoPE(q, m) = R_m @ q
RoPE(k, n) = R_n @ k

Where R is rotation matrix:
R_m = [cos(mθ₁)  -sin(mθ₁)    0        0     ...
       sin(mθ₁)   cos(mθ₁)    0        0     ...
         0          0      cos(mθ₂) -sin(mθ₂) ...
         0          0      sin(mθ₂)  cos(mθ₂) ...
        ...       ...       ...      ...     ...]

Result: (R_m @ q)^T @ (R_n @ k) = q^T @ R_{n-m} @ k

The attention score depends on relative position (n-m)!

RoPE for Time Series

class RotaryPositionalEncoding(nn.Module):
    """
    Rotary Position Embedding (RoPE) for time series

    Key insight: Rotate query/key vectors by position-dependent angle
    Result: Attention scores naturally encode relative position

    Benefits for time series:
    - Handles long sequences efficiently
    - Natural decay for distant positions
    - Works with variable-length sequences
    """

    def __init__(
        self,
        d_model: int,
        n_heads: int,
        max_len: int = 8192,
        base: float = 10000.0
    ):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.max_len = max_len
        self.base = base

        # Precompute frequencies
        inv_freq = 1.0 / (
            base ** (torch.arange(0, self.head_dim, 2).float() / self.head_dim)
        )
        self.register_buffer('inv_freq', inv_freq)

        # Precompute sin/cos cache
        self._set_cos_sin_cache(max_len)

    def _set_cos_sin_cache(self, seq_len: int):
        """Precompute cos and sin values"""
        t = torch.arange(seq_len, dtype=torch.float)
        freqs = torch.outer(t, self.inv_freq)

        # Stack sin and cos for rotation
        emb = torch.cat([freqs, freqs], dim=-1)
        self.register_buffer('cos_cached', emb.cos())
        self.register_buffer('sin_cached', emb.sin())

    def _rotate_half(self, x: torch.Tensor) -> torch.Tensor:
        """Rotate half the hidden dims"""
        x1, x2 = x.chunk(2, dim=-1)
        return torch.cat([-x2, x1], dim=-1)

    def forward(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        positions: torch.Tensor = None
    ) -> tuple:
        """
        Apply rotary embeddings to queries and keys

        Args:
            q: Query tensor [batch, n_heads, seq_len, head_dim]
            k: Key tensor [batch, n_heads, seq_len, head_dim]
            positions: Optional position indices [batch, seq_len]

        Returns:
            Tuple of rotated (query, key) tensors
        """
        batch, n_heads, seq_len, head_dim = q.shape

        if positions is None:
            cos = self.cos_cached[:seq_len].unsqueeze(0).unsqueeze(0)
            sin = self.sin_cached[:seq_len].unsqueeze(0).unsqueeze(0)
        else:
            cos = self.cos_cached[positions].unsqueeze(1)
            sin = self.sin_cached[positions].unsqueeze(1)

        # Apply rotation
        q_rot = (q * cos) + (self._rotate_half(q) * sin)
        k_rot = (k * cos) + (self._rotate_half(k) * sin)

        return q_rot, k_rot


class RoPETimeSeriesAttention(nn.Module):
    """
    Self-attention with RoPE for time series data
    """

    def __init__(
        self,
        d_model: int,
        n_heads: int,
        dropout: float = 0.1,
        max_len: int = 8192
    ):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads

        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.out_proj = nn.Linear(d_model, d_model)

        self.rope = RotaryPositionalEncoding(d_model, n_heads, max_len)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        """
        Args:
            x: Input [batch, seq_len, d_model]
            mask: Optional attention mask
        Returns:
            Output [batch, seq_len, d_model]
        """
        batch, seq_len, d_model = x.shape

        # Project to Q, K, V
        q = self.q_proj(x).view(batch, seq_len, self.n_heads, self.head_dim)
        k = self.k_proj(x).view(batch, seq_len, self.n_heads, self.head_dim)
        v = self.v_proj(x).view(batch, seq_len, self.n_heads, self.head_dim)

        # Transpose for attention
        q = q.transpose(1, 2)  # [batch, n_heads, seq_len, head_dim]
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # Apply RoPE
        q, k = self.rope(q, k)

        # Compute attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Apply attention to values
        out = torch.matmul(attn_weights, v)

        # Reshape and project output
        out = out.transpose(1, 2).contiguous().view(batch, seq_len, d_model)
        out = self.out_proj(out)

        return out

Temporal Encodings for Finance

Specialized encodings that capture financial market patterns:

Calendar Features

class CalendarEncoding(nn.Module):
    """
    Encode calendar features important for financial markets

    Features:
    - Day of week (Monday effect)
    - Month (January effect, month-end rebalancing)
    - Quarter (earnings seasons)
    - Year (for regime awareness)
    - Trading session (market hours)
    """

    def __init__(self, d_model: int, dropout: float = 0.1):
        super().__init__()

        # Dimension allocation
        self.d_dayofweek = d_model // 8
        self.d_month = d_model // 8
        self.d_quarter = d_model // 16
        self.d_hour = d_model // 8
        self.d_session = d_model // 16

        # Embeddings
        self.dayofweek_emb = nn.Embedding(7, self.d_dayofweek)      # Mon-Sun
        self.month_emb = nn.Embedding(12, self.d_month)             # Jan-Dec
        self.quarter_emb = nn.Embedding(4, self.d_quarter)          # Q1-Q4
        self.hour_emb = nn.Embedding(24, self.d_hour)               # 0-23
        self.session_emb = nn.Embedding(4, self.d_session)          # Pre/Regular/After/Closed

        # Project to model dimension
        total_dim = (self.d_dayofweek + self.d_month + self.d_quarter +
                     self.d_hour + self.d_session)
        self.proj = nn.Linear(total_dim, d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        dayofweek: torch.Tensor,
        month: torch.Tensor,
        quarter: torch.Tensor,
        hour: torch.Tensor,
        session: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            All inputs: [batch, seq_len]
        Returns:
            Calendar encoding [batch, seq_len, d_model]
        """
        dow = self.dayofweek_emb(dayofweek)
        mon = self.month_emb(month)
        qtr = self.quarter_emb(quarter)
        hr = self.hour_emb(hour)
        ses = self.session_emb(session)

        combined = torch.cat([dow, mon, qtr, hr, ses], dim=-1)
        out = self.proj(combined)

        return self.dropout(out)

Market Session Encoding

class MarketSessionEncoding(nn.Module):
    """
    Encode market session information

    For crypto (24/7): Time-of-day patterns
    For stocks: Pre-market, regular, after-hours, closed
    """

    def __init__(
        self,
        d_model: int,
        market_type: str = 'crypto',  # 'crypto' or 'stock'
        dropout: float = 0.1
    ):
        super().__init__()
        self.market_type = market_type

        if market_type == 'crypto':
            # 24-hour cycle with Asian/European/American sessions
            self.session_emb = nn.Embedding(3, d_model // 3)  # Asia/Europe/US
            self.hour_emb = nn.Embedding(24, d_model // 3)
            self.remaining = d_model - 2 * (d_model // 3)
            self.proj = nn.Linear(2 * (d_model // 3), d_model)
        else:
            # Stock market sessions
            self.session_emb = nn.Embedding(4, d_model // 2)  # Pre/Regular/After/Closed
            self.time_in_session = nn.Embedding(100, d_model // 2)  # Minutes into session
            self.proj = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)

    def _get_crypto_session(self, hour: torch.Tensor) -> torch.Tensor:
        """Map hour to crypto trading session"""
        # Asia: 0-8 UTC, Europe: 8-16 UTC, US: 16-24 UTC
        session = torch.zeros_like(hour)
        session[(hour >= 0) & (hour < 8)] = 0   # Asia
        session[(hour >= 8) & (hour < 16)] = 1  # Europe
        session[(hour >= 16) & (hour < 24)] = 2 # US
        return session

    def forward(
        self,
        hour: torch.Tensor,
        session: torch.Tensor = None,
        time_in_session: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Args:
            hour: Hour of day [batch, seq_len]
            session: Market session (for stocks) [batch, seq_len]
            time_in_session: Minutes into session [batch, seq_len]
        """
        if self.market_type == 'crypto':
            crypto_session = self._get_crypto_session(hour)
            ses_emb = self.session_emb(crypto_session)
            hr_emb = self.hour_emb(hour)
            combined = torch.cat([ses_emb, hr_emb], dim=-1)
        else:
            ses_emb = self.session_emb(session)
            time_emb = self.time_in_session(time_in_session.clamp(0, 99))
            combined = torch.cat([ses_emb, time_emb], dim=-1)

        out = self.proj(combined)
        return self.dropout(out)

Multi-Scale Temporal Encoding

class MultiScaleTemporalEncoding(nn.Module):
    """
    Encode time at multiple scales for comprehensive temporal representation

    Scales:
    - Micro: Within trading session (minutes)
    - Intraday: Hours within day
    - Daily: Day patterns
    - Weekly: Week patterns
    - Monthly: Month patterns
    """

    def __init__(
        self,
        d_model: int,
        time_scales: list = ['minute', 'hour', 'day', 'week', 'month'],
        dropout: float = 0.1
    ):
        super().__init__()
        self.time_scales = time_scales

        # Dimension per scale
        d_per_scale = d_model // len(time_scales)
        self.d_per_scale = d_per_scale

        # Embeddings for each scale
        self.scale_embeddings = nn.ModuleDict()
        scale_sizes = {
            'minute': 60,
            'hour': 24,
            'day': 31,
            'week': 7,
            'month': 12
        }

        for scale in time_scales:
            self.scale_embeddings[scale] = nn.Embedding(
                scale_sizes[scale],
                d_per_scale
            )

        # Final projection
        self.proj = nn.Linear(d_per_scale * len(time_scales), d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, timestamps: dict) -> torch.Tensor:
        """
        Args:
            timestamps: Dict with keys for each scale
                       e.g., {'minute': [batch, seq], 'hour': [batch, seq], ...}
        Returns:
            Multi-scale temporal encoding [batch, seq_len, d_model]
        """
        embeddings = []

        for scale in self.time_scales:
            if scale in timestamps:
                emb = self.scale_embeddings[scale](timestamps[scale])
                embeddings.append(emb)
            else:
                # Zero embedding if scale not provided
                batch, seq_len = next(iter(timestamps.values())).shape
                embeddings.append(
                    torch.zeros(batch, seq_len, self.d_per_scale,
                               device=next(iter(timestamps.values())).device)
                )

        combined = torch.cat(embeddings, dim=-1)
        out = self.proj(combined)

        return self.dropout(out)

Practical Examples

01: Comparing Encoding Methods

See python/01_compare_encodings.py for a comparison of all encoding methods.

# Example: Compare encoding methods on crypto data

import torch
from positional_encoding import (
    SinusoidalPositionalEncoding,
    LearnedPositionalEncoding,
    RotaryPositionalEncoding,
    MultiScaleTemporalEncoding
)

# Create sample data
batch_size = 32
seq_len = 168  # 7 days of hourly data
d_model = 64

x = torch.randn(batch_size, seq_len, d_model)

# Initialize encoders
sinusoidal = SinusoidalPositionalEncoding(d_model)
learned = LearnedPositionalEncoding(d_model, max_len=512)
rope = RotaryPositionalEncoding(d_model, n_heads=4)
multiscale = MultiScaleTemporalEncoding(d_model)

# Apply encodings
x_sin = sinusoidal(x)
x_learned = learned(x)

# For RoPE, need to work with Q/K matrices
q = torch.randn(batch_size, 4, seq_len, d_model // 4)
k = torch.randn(batch_size, 4, seq_len, d_model // 4)
q_rope, k_rope = rope(q, k)

print(f"Sinusoidal output shape: {x_sin.shape}")
print(f"Learned output shape: {x_learned.shape}")
print(f"RoPE Q shape: {q_rope.shape}")

02: Time Series Transformer

See python/model.py for complete transformer with positional encoding.

03: Crypto Price Prediction

# Example: BTCUSDT prediction with RoPE

from data import BybitDataLoader
from model import TimeSeriesTransformer

# Load Bybit data
loader = BybitDataLoader()
data = loader.load_klines('BTCUSDT', interval='1h', limit=2000)

# Create model with RoPE encoding
config = {
    'd_model': 64,
    'n_heads': 4,
    'n_layers': 2,
    'encoding_type': 'rope',
    'dropout': 0.1
}

model = TimeSeriesTransformer(config)

# Train and predict
# ... (see examples/)

04: Stock Market Forecasting

# Example: S&P 500 with calendar encoding

from model import StockTransformer

# Create model with calendar features
config = {
    'd_model': 128,
    'n_heads': 8,
    'use_calendar_encoding': True,
    'use_market_session': True
}

model = StockTransformer(config)

# The model will use both position and calendar information
# for improved predictions around market events

05: Backtesting Strategies

See python/strategy.py for backtesting framework.

Rust Implementation

See rust_positional_encoding for complete Rust implementation.

rust_positional_encoding/
├── Cargo.toml
├── README.md
├── src/
│   ├── lib.rs              # Main library exports
│   ├── api/                # Bybit API client
│   │   ├── mod.rs
│   │   ├── client.rs       # HTTP client for Bybit
│   │   └── types.rs        # API response types
│   ├── data/               # Data processing
│   │   ├── mod.rs
│   │   ├── loader.rs       # Data loading utilities
│   │   └── features.rs     # Feature engineering
│   ├── encoding/           # Positional encoding implementations
│   │   ├── mod.rs
│   │   ├── sinusoidal.rs   # Sinusoidal encoding
│   │   ├── learned.rs      # Learned encoding
│   │   ├── relative.rs     # Relative encoding
│   │   ├── rope.rs         # Rotary position embedding
│   │   └── temporal.rs     # Calendar/temporal encoding
│   ├── model/              # Transformer model
│   │   ├── mod.rs
│   │   ├── attention.rs    # Self-attention with encoding
│   │   └── transformer.rs  # Complete model
│   └── strategy/           # Trading strategy
│       ├── mod.rs
│       ├── signals.rs      # Signal generation
│       └── backtest.rs     # Backtesting engine
└── examples/
    ├── compare_encodings.rs
    ├── fetch_data.rs
    ├── train.rs
    └── backtest.rs

Quick Start (Rust)

# Navigate to Rust project
cd rust_positional_encoding

# Fetch data from Bybit
cargo run --example fetch_data -- --symbol BTCUSDT --interval 1h

# Compare encoding methods
cargo run --example compare_encodings

# Train model
cargo run --example train -- --epochs 100 --encoding rope

# Run backtest
cargo run --example backtest -- --start 2024-01-01 --end 2024-12-31

Python Implementation

See python/ for Python implementation.

python/
├── positional_encoding.py  # All encoding implementations
├── model.py                # Transformer model
├── data.py                 # Bybit data loading
├── strategy.py             # Trading strategy
├── train.py                # Training script
├── requirements.txt        # Dependencies
└── examples/
    ├── 01_compare_encodings.py
    ├── 02_crypto_prediction.py
    ├── 03_stock_prediction.py
    └── 04_backtesting.py

Quick Start (Python)

# Install dependencies
pip install -r requirements.txt

# Compare encoding methods
python examples/01_compare_encodings.py

# Train crypto prediction model
python train.py --symbol BTCUSDT --encoding rope

# Run backtest
python examples/04_backtesting.py

Best Practices

Choosing an Encoding Method

Use Case	Recommended Encoding	Reason
Fixed-length sequences	Sinusoidal or Learned	Simple, effective
Variable-length sequences	RoPE or Relative	Handles different lengths
Long sequences (>512)	RoPE	Better extrapolation
Calendar-dependent patterns	Calendar + Sinusoidal	Captures market effects
Crypto 24/7 markets	RoPE + Session	Continuous time awareness
Stock markets	Calendar + Market Session	Trading hour patterns

Hyperparameter Recommendations

Parameter	Recommended	Notes
`d_model`	64-256	Larger for complex patterns
`dropout`	0.1-0.2	Higher for small datasets
`max_len`	2x training length	Allow extrapolation
`temperature` (sinusoidal)	10000	Standard value
`base` (RoPE)	10000	Can tune for longer sequences

Common Pitfalls

Fixed-length learned embeddings: Cannot extrapolate
Ignoring calendar features: Missing market patterns
Over-engineering: Simple sinusoidal often sufficient
Not scaling positions: Normalize for long sequences

Resources

Papers

Attention Is All You Need — Original Transformer with sinusoidal encoding
RoFormer: Enhanced Transformer with Rotary Position Embedding — RoPE paper
Self-Attention with Relative Position Representations — Shaw’s relative encoding
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context — Segment-level recurrence
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting — Time series transformers

Implementations

PyTorch Transformers — Official PyTorch
HuggingFace Transformers — Pre-trained models
x-transformers — Various position encodings

Chapter 26: Temporal Fusion Transformers — Multi-horizon forecasting
Chapter 43: Stockformer Multivariate — Cross-asset attention
Chapter 47: Cross-Attention Multi-Asset — Cross-asset modeling
Chapter 51: Linformer Long Sequences — Efficient attention

Difficulty Level

Intermediate to Advanced

Prerequisites:

Transformer architecture fundamentals
Self-attention mechanism
Time series basics
PyTorch/Rust ML libraries

Chapter 48: Positional Encoding for Time Series

Chapter 48: Positional Encoding for Time Series

Contents

Introduction to Positional Encoding

Why Position Matters

Time Series Challenges

Types of Positional Encoding

Sinusoidal Positional Encoding

Mathematical Foundation

Sinusoidal Implementation

Time Series Adaptations

Learned Positional Encoding

Trainable Embeddings

Advantages for Financial Data

Relative Positional Encoding

Shaw’s Relative Attention

XLNet Style Encoding

Rotary Position Embedding (RoPE)

RoPE Mathematical Formulation

RoPE for Time Series

Temporal Encodings for Finance

Calendar Features

Market Session Encoding

Multi-Scale Temporal Encoding

Practical Examples

01: Comparing Encoding Methods

02: Time Series Transformer

03: Crypto Price Prediction

04: Stock Market Forecasting

05: Backtesting Strategies

Rust Implementation

Quick Start (Rust)

Python Implementation

Quick Start (Python)

Best Practices

Choosing an Encoding Method

Hyperparameter Recommendations

Common Pitfalls

Resources

Papers

Implementations

Related Chapters

Difficulty Level