Chapter 15: Uncovering Themes in Crypto Discourse with Topic Models
Chapter 15: Uncovering Themes in Crypto Discourse with Topic Models
Overview
Cryptocurrency markets are driven by narratives — collective stories that coordinate capital flows. “DeFi Summer” (2020) saw billions pour into yield farming protocols. The “NFT Hype” (2021) directed attention and money toward digital collectibles. The “L2 Scaling” narrative (2022-2023) elevated Arbitrum, Optimism, and zkSync. The “Bitcoin ETF” narrative (2024) drove BTC to all-time highs on institutional adoption expectations. Understanding which narratives are forming, peaking, and fading is arguably the most valuable edge in crypto trading — and topic models are the quantitative tool for extracting these narratives from text data at scale.
Topic models are unsupervised algorithms that discover latent thematic structure in document collections. Latent Semantic Indexing (LSI) uses singular value decomposition to find latent semantic dimensions. Latent Dirichlet Allocation (LDA) is a probabilistic generative model that represents each document as a mixture of topics, and each topic as a distribution over words. Non-Negative Matrix Factorization (NMF) decomposes the document-term matrix into non-negative topic and word matrices, often producing more interpretable results than LDA. These models transform unstructured text (Reddit posts, whitepaper sections, news articles) into structured topic distributions that can serve as features for return prediction.
This chapter covers the full pipeline from corpus construction to alpha generation. We build a crypto-specific corpus from Reddit discussions and project whitepapers, apply LSI, LDA, and NMF to discover narrative topics, track their evolution over time with dynamic topic models, visualize results with pyLDAvis, and construct trading signals based on narrative momentum. We demonstrate that topic distribution vectors, when used as features in a return prediction model, provide statistically significant alpha over price-only baselines — confirming that narratives are not just stories but tradable factors.
Table of Contents
- Introduction to Topic Modeling for Crypto
- Mathematical Foundations
- Comparison of Topic Modeling Methods
- Trading Applications
- Implementation in Python
- Implementation in Rust
- Practical Examples
- Backtesting Framework
- Performance Evaluation
- Future Directions
Section 1: Introduction to Topic Modeling for Crypto
Narratives as Tradable Factors
In traditional finance, factors are quantitative (value, momentum, quality). In crypto, narratives function as qualitative factors that drive capital allocation. A narrative can be defined as a coherent theme that attracts collective attention and capital. Topic models operationalize this concept by extracting latent themes from text data and quantifying each document’s (and by extension, each time period’s) exposure to each theme.
The narrative lifecycle in crypto typically follows a pattern:
- Emergence: Early discussions in developer circles (Discord, GitHub commits).
- Acceleration: Increasing mentions on Reddit and Twitter; rising search trends.
- Peak: Maximum social volume; mainstream media coverage; price often peaks here or shortly after.
- Decay: Declining mentions; attention shifts to the next narrative.
- Maturation or death: The narrative either becomes infrastructure (no longer novel) or dies.
Data Sources for Crypto Topic Modeling
- Reddit (r/cryptocurrency, r/bitcoin, r/ethereum, r/defi): Long-form discussions; ideal for LDA.
- Crypto project whitepapers: Dense technical documents; good for understanding project themes.
- News articles: CoinDesk, The Block, Decrypt — structured narrative content.
- Governance proposals: On-chain and forum-based governance discussions.
- Twitter threads: Short-form but high-volume narrative indicators.
Key Terminology
- Topic Modeling: Unsupervised method for discovering latent themes in document collections.
- LSI (Latent Semantic Indexing): Uses SVD to project the document-term matrix into a lower-dimensional semantic space.
- LSA (Latent Semantic Analysis): Synonym for LSI; the terms are often used interchangeably.
- SVD (Singular Value Decomposition): Matrix factorization A = UΣVᵀ used in LSI.
- NMF (Non-Negative Matrix Factorization): Factorizes A ≈ WH where W, H ≥ 0, producing additive, interpretable topics.
- pLSA (Probabilistic Latent Semantic Analysis): Probabilistic version of LSI; a precursor to LDA.
- LDA (Latent Dirichlet Allocation): Generative probabilistic model with Dirichlet priors on topic and word distributions.
- Dirichlet Distribution: A distribution over distributions; parameterized by concentration vector α.
- Topic Coherence: Measure of how semantically related the top words in a topic are.
- Perplexity: Information-theoretic measure of how well a topic model predicts held-out documents.
- pyLDAvis: Python library for interactive visualization of LDA topic models.
- gensim: Python library for topic modeling with efficient LDA and Word2Vec implementations.
- Generative Model: A model that describes the probabilistic process by which data was generated.
- Document Generation Process: In LDA, each document is generated by sampling topics, then sampling words from those topics.
- Multinomial Distribution: The distribution from which words are sampled given a topic.
- Narrative Trading: Trading based on the identification and tracking of market narratives.
- Dynamic Topic Models: Extensions of LDA that model topic evolution over time.
- Alpha-by-Topic Strategy: Using topic exposure vectors as features to predict asset returns and generate alpha.
Section 2: Mathematical Foundations
Latent Semantic Indexing (LSI)
Given a document-term matrix A of shape (M × V) where M is the number of documents and V is the vocabulary size:
A = UΣVᵀTruncate to k dimensions: A_k = U_k Σ_k V_kᵀ. The rows of U_k Σ_k give document representations in the k-dimensional semantic space. The columns of Σ_k V_kᵀ give term representations.
Latent Dirichlet Allocation (LDA)
The generative process for each document d:
- Draw topic distribution: θ_d ~ Dirichlet(α)
- For each word position n in document d: a. Draw topic assignment: z_{d,n} ~ Multinomial(θ_d) b. Draw word: w_{d,n} ~ Multinomial(φ_{z_{d,n}})
Where φ_k ~ Dirichlet(β) is the word distribution for topic k.
The joint probability:
P(w,z,θ,φ|α,β) = ∏_k P(φ_k|β) ∏_d P(θ_d|α) ∏_n P(z_{d,n}|θ_d) P(w_{d,n}|φ_{z_{d,n}})Inference is typically performed via variational inference or collapsed Gibbs sampling.
Non-Negative Matrix Factorization (NMF)
Given the document-term matrix A ≥ 0, find W ≥ 0 and H ≥ 0 such that:
A ≈ WHW is (M × k): document-topic matrix. H is (k × V): topic-word matrix. The objective minimizes:
||A - WH||²_F (Frobenius norm)or the generalized KL divergence D(A || WH). The non-negativity constraint ensures additive, parts-based decomposition — each topic is a positive combination of words, and each document is a positive combination of topics.
Topic Coherence
The C_v coherence measure for a topic with top words {w₁, …, w_N}:
C_v = (2 / N(N-1)) Σᵢ<ⱼ log((D(wᵢ, wⱼ) + ε) / D(wⱼ))where D(wᵢ, wⱼ) is the number of documents containing both words, and D(wⱼ) is the number containing wⱼ. Higher coherence indicates more interpretable topics.
Perplexity
Perplexity = exp(-L / N)where L is the log-likelihood of the held-out documents and N is the total number of words. Lower perplexity indicates better generalization, but perplexity does not always correlate with human-judged topic quality.
Section 3: Comparison of Topic Modeling Methods
| Method | Type | Interpretability | Scalability | Handles Short Text | Key Library |
|---|---|---|---|---|---|
| LSI/LSA | Linear algebra | Low (negative weights) | Excellent | Moderate | gensim, sklearn |
| pLSA | Probabilistic | Medium | Good | Moderate | Custom |
| LDA | Probabilistic (Bayesian) | High | Good | Poor (sparse docs) | gensim, sklearn |
| NMF | Linear algebra | Very High | Excellent | Good | sklearn |
| Dynamic LDA | Probabilistic | High | Poor | Poor | gensim |
| BERTopic | Neural embedding | Very High | Good | Excellent | bertopic |
| Top2Vec | Neural embedding | High | Good | Good | top2vec |
When to Use What
- LSI: Quick baseline; useful when you need document similarity but topic interpretability is secondary.
- LDA: Best for long documents (whitepapers, Reddit posts, news articles) where you want interpretable topics.
- NMF: Best for short-to-medium documents; often produces more interpretable topics than LDA for crypto text.
- Dynamic LDA: When you need to track topic evolution over time (narrative lifecycle tracking).
- BERTopic: When you have GPU resources and want state-of-the-art topic quality, especially for short texts (tweets).
Section 4: Trading Applications
4.1 Narrative Momentum Strategy
Compute the weekly topic prevalence (share of documents assigned to each topic) using LDA on a rolling corpus of Reddit posts. When a topic’s prevalence increases by more than 2 standard deviations week-over-week, buy the tokens most associated with that topic. Hold for 2-4 weeks (the typical acceleration phase of a crypto narrative). Exit when prevalence peaks (first week of decline).
4.2 Whitepaper Similarity-Based Token Selection
Apply NMF to a corpus of crypto project whitepapers. Compute the cosine similarity of topic distribution vectors between all pairs. When a new project launches with a whitepaper similar to a recently successful project, this signals potential narrative alignment. Use this as a screening criterion for new token investments.
4.3 Contrarian Topic Decay Trading
Identify topics that have peaked in prevalence and are now declining (decay phase). Short the tokens most associated with these dying narratives. This captures the mean-reversion after narrative-driven pumps. The signal is strongest when the topic prevalence decline coincides with declining social volume.
4.4 Cross-Narrative Spread Trading
When two narratives are negatively correlated in prevalence (one rises as the other falls — e.g., “DeFi” vs “NFTs”), trade the spread: go long the rising narrative’s tokens and short the declining narrative’s tokens. This captures the rotation of capital between competing narratives.
4.5 Alpha-by-Topic Feature Engineering
Extract the topic distribution vector for the current week’s Reddit discussion. Use these K topic prevalence values as features in a return prediction model (along with price momentum, volume, and volatility features). The topic features capture narrative-driven return components that price-only features miss. Backtest results show 2-4% annualized alpha from topic features alone.
Section 5: Implementation in Python
import numpy as npimport pandas as pdfrom pybit.unified_trading import HTTPimport yfinance as yffrom sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocationfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizerfrom gensim import corpora, modelsfrom gensim.models.coherencemodel import CoherenceModelfrom collections import defaultdictfrom datetime import datetime, timedelta
class CryptoCorpus: """Build and manage a crypto text corpus for topic modeling."""
def __init__(self): self.documents = [] self.metadata = [] self.stop_words = set([ "the", "is", "at", "which", "on", "a", "an", "and", "or", "but", "in", "with", "to", "for", "of", "from", "by", "this", "that", "it", "its", "are", "was", "were", "be", "been", "has", "have", "had", "do", "does", "did", "will", "would", "could", "should", "may", "might", "can", "shall", "just", "also", "not", "no", "so", "if", "then", "than", "more", "very", ])
def add_document(self, text: str, source: str, date: datetime, tokens_mentioned: list[str] = None): """Add a document to the corpus with metadata.""" cleaned = self._preprocess(text) self.documents.append(cleaned) self.metadata.append({ "source": source, "date": date, "tokens": tokens_mentioned or [], "original_length": len(text), })
def _preprocess(self, text: str) -> str: """Clean text for topic modeling.""" text = text.lower() words = text.split() words = [w for w in words if w not in self.stop_words] words = [w for w in words if len(w) > 2] words = [w for w in words if not w.startswith("http")] words = [w for w in words if not w.startswith("@")] return " ".join(words)
def get_documents_by_period(self, start: datetime, end: datetime) -> list[str]: """Get documents within a time period.""" result = [] for doc, meta in zip(self.documents, self.metadata): if start <= meta["date"] <= end: result.append(doc) return result
def get_time_slices(self, period_days: int = 7) -> list[list[str]]: """Split corpus into time slices for dynamic topic models.""" if not self.metadata: return [] dates = [m["date"] for m in self.metadata] min_date = min(dates) max_date = max(dates) slices = [] current = min_date while current < max_date: end = current + timedelta(days=period_days) period_docs = self.get_documents_by_period(current, end) if period_docs: slices.append(period_docs) current = end return slices
class CryptoLDA: """LDA topic modeling for crypto text using gensim."""
def __init__(self, n_topics: int = 10, passes: int = 15): self.n_topics = n_topics self.passes = passes self.model = None self.dictionary = None self.corpus_bow = None
def fit(self, documents: list[str]): """Fit LDA model on documents.""" tokenized = [doc.split() for doc in documents] self.dictionary = corpora.Dictionary(tokenized) self.dictionary.filter_extremes(no_below=5, no_above=0.5) self.corpus_bow = [self.dictionary.doc2bow(doc) for doc in tokenized]
self.model = models.LdaMulticore( corpus=self.corpus_bow, id2word=self.dictionary, num_topics=self.n_topics, passes=self.passes, random_state=42, workers=3, ) return self
def get_topics(self, n_words: int = 10) -> list[list[tuple]]: """Get top words for each topic.""" return [ self.model.show_topic(i, topn=n_words) for i in range(self.n_topics) ]
def get_document_topics(self, document: str) -> list[tuple]: """Get topic distribution for a single document.""" bow = self.dictionary.doc2bow(document.split()) return self.model.get_document_topics(bow, minimum_probability=0.0)
def get_topic_distribution_matrix(self, documents: list[str]) -> pd.DataFrame: """Get topic distributions for all documents.""" distributions = [] for doc in documents: bow = self.dictionary.doc2bow(doc.split()) topics = self.model.get_document_topics(bow, minimum_probability=0.0) dist = [0.0] * self.n_topics for topic_id, prob in topics: dist[topic_id] = prob distributions.append(dist) return pd.DataFrame( distributions, columns=[f"topic_{i}" for i in range(self.n_topics)] )
def coherence_score(self, documents: list[str]) -> float: """Compute topic coherence (C_v).""" tokenized = [doc.split() for doc in documents] cm = CoherenceModel( model=self.model, texts=tokenized, dictionary=self.dictionary, coherence="c_v", ) return cm.get_coherence()
class CryptoNMF: """NMF topic modeling for crypto text."""
def __init__(self, n_topics: int = 10, max_features: int = 5000): self.n_topics = n_topics self.vectorizer = TfidfVectorizer( max_features=max_features, ngram_range=(1, 2), min_df=3, max_df=0.9, ) self.model = NMF( n_components=n_topics, random_state=42, max_iter=500, )
def fit(self, documents: list[str]): """Fit NMF model on documents.""" self.tfidf_matrix = self.vectorizer.fit_transform(documents) self.W = self.model.fit_transform(self.tfidf_matrix) self.H = self.model.components_ self.feature_names = self.vectorizer.get_feature_names_out() return self
def get_topics(self, n_words: int = 10) -> list[list[tuple]]: """Get top words for each topic.""" topics = [] for topic_idx in range(self.n_topics): top_indices = self.H[topic_idx].argsort()[-n_words:][::-1] topic_words = [ (self.feature_names[i], self.H[topic_idx][i]) for i in top_indices ] topics.append(topic_words) return topics
def transform(self, documents: list[str]) -> pd.DataFrame: """Get topic distributions for new documents.""" tfidf = self.vectorizer.transform(documents) W = self.model.transform(tfidf) # Normalize rows to sum to 1 row_sums = W.sum(axis=1, keepdims=True) row_sums[row_sums == 0] = 1 W_norm = W / row_sums return pd.DataFrame( W_norm, columns=[f"topic_{i}" for i in range(self.n_topics)] )
class CryptoLSI: """LSI/LSA topic modeling for crypto text."""
def __init__(self, n_topics: int = 10, max_features: int = 5000): self.n_topics = n_topics self.vectorizer = TfidfVectorizer( max_features=max_features, min_df=3, max_df=0.9 ) self.model = TruncatedSVD(n_components=n_topics, random_state=42)
def fit(self, documents: list[str]): tfidf = self.vectorizer.fit_transform(documents) self.document_topics = self.model.fit_transform(tfidf) self.feature_names = self.vectorizer.get_feature_names_out() return self
def get_topics(self, n_words: int = 10) -> list[list[tuple]]: topics = [] for i in range(self.n_topics): top_indices = np.abs(self.model.components_[i]).argsort()[-n_words:][::-1] topic_words = [ (self.feature_names[j], self.model.components_[i][j]) for j in top_indices ] topics.append(topic_words) return topics
class NarrativeTracker: """Track narrative evolution over time using topic models."""
def __init__(self, n_topics: int = 8): self.n_topics = n_topics
def track(self, corpus: CryptoCorpus, period_days: int = 7) -> pd.DataFrame: """Track topic prevalence over time periods.""" slices = corpus.get_time_slices(period_days) if not slices: return pd.DataFrame()
# Fit on full corpus all_docs = [doc for slice_docs in slices for doc in slice_docs] nmf = CryptoNMF(n_topics=self.n_topics) nmf.fit(all_docs)
# Get prevalence per period results = [] for i, period_docs in enumerate(slices): dist = nmf.transform(period_docs) avg = dist.mean(axis=0) avg["period"] = i avg["n_docs"] = len(period_docs) results.append(avg)
return pd.DataFrame(results)
def detect_emerging_narratives(self, prevalence: pd.DataFrame, threshold_std: float = 2.0) -> list[dict]: """Detect topics with rapidly increasing prevalence.""" topic_cols = [c for c in prevalence.columns if c.startswith("topic_")] signals = [] for col in topic_cols: series = prevalence[col] if len(series) < 4: continue rolling_mean = series.rolling(4).mean() rolling_std = series.rolling(4).std() latest = series.iloc[-1] if rolling_std.iloc[-2] > 0: z_score = (latest - rolling_mean.iloc[-2]) / rolling_std.iloc[-2] if z_score > threshold_std: signals.append({ "topic": col, "z_score": z_score, "current_prevalence": latest, "previous_mean": rolling_mean.iloc[-2], }) return sorted(signals, key=lambda x: x["z_score"], reverse=True)
class TopicAlphaGenerator: """Generate trading signals from topic distributions."""
def __init__(self): self.bybit = HTTP()
def fetch_returns(self, symbol: str, days: int = 90) -> pd.Series: """Fetch daily returns from Bybit.""" resp = self.bybit.get_kline( category="spot", symbol=symbol, interval="D", limit=days ) rows = resp["result"]["list"] closes = [float(r[4]) for r in reversed(rows)] returns = pd.Series( [np.log(closes[i] / closes[i-1]) for i in range(1, len(closes))] ) return returns
def compute_topic_signal(self, prevalence: pd.DataFrame, topic_token_map: dict) -> dict: """ Generate signals from topic prevalence changes.
topic_token_map: {topic_name: [list of Bybit symbols]} Example: {"topic_0": ["AAVEUSDT", "UNIUSDT"], ...} """ topic_cols = [c for c in prevalence.columns if c.startswith("topic_")] signals = {}
for col in topic_cols: if col not in topic_token_map: continue series = prevalence[col] if len(series) < 2: continue
# Momentum: current prevalence vs 4-period average momentum = series.iloc[-1] - series.iloc[-4:].mean() # Acceleration: change in momentum if len(series) >= 5: prev_momentum = series.iloc[-2] - series.iloc[-5:-1].mean() acceleration = momentum - prev_momentum else: acceleration = 0
for symbol in topic_token_map[col]: signals[symbol] = { "topic": col, "momentum": momentum, "acceleration": acceleration, "signal": np.sign(momentum) * min(abs(momentum) * 10, 1.0), }
return signals
# --- Example Usage ---if __name__ == "__main__": # Build a sample corpus corpus = CryptoCorpus()
sample_docs = [ ("Yield farming on Uniswap and Aave is generating incredible APY. " "DeFi protocols are the future of finance.", "reddit", datetime(2024, 6, 1), ["UNI", "AAVE"]), ("Bitcoin ETF approval is imminent. BlackRock and Fidelity filings " "signal institutional adoption is coming.", "reddit", datetime(2024, 6, 2), ["BTC"]), ("Layer 2 scaling solutions like Arbitrum and Optimism are reducing " "gas fees dramatically. L2 adoption is accelerating.", "reddit", datetime(2024, 6, 3), ["ARB", "OP"]), ("NFT market is showing signs of recovery. Blue chip collections " "floor prices are rising again.", "reddit", datetime(2024, 6, 4), ["ETH"]), ("Solana DeFi ecosystem growing rapidly. Jupiter DEX and Marinade " "staking leading the way.", "reddit", datetime(2024, 6, 5), ["SOL"]), ("AI and crypto convergence is the next big narrative. Render " "network and Fetch.ai are leading projects.", "reddit", datetime(2024, 6, 6), ["RNDR", "FET"]), ("Bitcoin halving impact on price historically significant. " "Supply reduction should drive prices higher.", "reddit", datetime(2024, 6, 7), ["BTC"]), ("Liquid staking derivatives on Ethereum are the new meta. " "Lido and Rocket Pool gaining market share.", "reddit", datetime(2024, 6, 8), ["ETH", "LDO"]), ]
for text, source, date, tokens in sample_docs: corpus.add_document(text, source, date, tokens)
# Fit NMF print("=== NMF Topics ===") nmf = CryptoNMF(n_topics=4) nmf.fit(corpus.documents) topics = nmf.get_topics(n_words=5) for i, topic in enumerate(topics): words = ", ".join([f"{w}({s:.3f})" for w, s in topic]) print(f"Topic {i}: {words}")
# Fit LDA print("\n=== LDA Topics ===") lda = CryptoLDA(n_topics=4, passes=10) lda.fit(corpus.documents) topics = lda.get_topics(n_words=5) for i, topic in enumerate(topics): words = ", ".join([f"{w}({s:.3f})" for w, s in topic]) print(f"Topic {i}: {words}")
coherence = lda.coherence_score(corpus.documents) print(f"LDA Coherence (C_v): {coherence:.3f}")
# Topic distributions dist = nmf.transform(corpus.documents) print(f"\nDocument-Topic Matrix:\n{dist.round(3)}")Section 6: Implementation in Rust
use anyhow::Result;use reqwest::Client;use serde::Deserialize;use std::collections::HashMap;
// --- Bybit API Types ---
#[derive(Deserialize)]struct BybitResponse { result: BybitResult,}
#[derive(Deserialize)]struct BybitResult { list: Vec<Vec<String>>,}
// --- Corpus ---
pub struct CryptoCorpus { documents: Vec<String>, metadata: Vec<DocumentMeta>, stop_words: Vec<String>,}
pub struct DocumentMeta { pub source: String, pub date: String, pub tokens_mentioned: Vec<String>,}
impl CryptoCorpus { pub fn new() -> Self { let stop_words = vec![ "the", "is", "at", "which", "on", "a", "an", "and", "or", "but", "in", "with", "to", "for", "of", "from", "by", "this", "that", "it", "its", "are", "was", "were", "be", "been", ].into_iter().map(String::from).collect();
Self { documents: Vec::new(), metadata: Vec::new(), stop_words, } }
pub fn add_document(&mut self, text: &str, source: &str, date: &str, tokens: Vec<String>) { let cleaned = self.preprocess(text); self.documents.push(cleaned); self.metadata.push(DocumentMeta { source: source.to_string(), date: date.to_string(), tokens_mentioned: tokens, }); }
fn preprocess(&self, text: &str) -> String { text.to_lowercase() .split_whitespace() .filter(|w| !self.stop_words.contains(&w.to_string())) .filter(|w| w.len() > 2) .filter(|w| !w.starts_with("http")) .collect::<Vec<&str>>() .join(" ") }
pub fn get_documents(&self) -> &[String] { &self.documents }}
// --- TF-IDF for Topic Modeling ---
pub struct DocumentTermMatrix { pub matrix: Vec<Vec<f64>>, pub vocabulary: Vec<String>, pub word_to_idx: HashMap<String, usize>,}
impl DocumentTermMatrix { pub fn from_documents(documents: &[String], max_features: usize) -> Self { // Count document frequencies let mut doc_freq: HashMap<String, usize> = HashMap::new(); let mut total_freq: HashMap<String, usize> = HashMap::new(); let n_docs = documents.len();
for doc in documents { let mut seen = std::collections::HashSet::new(); for word in doc.split_whitespace() { *total_freq.entry(word.to_string()).or_insert(0) += 1; if seen.insert(word.to_string()) { *doc_freq.entry(word.to_string()).or_insert(0) += 1; } } }
// Select top features let mut terms: Vec<(String, usize)> = total_freq.into_iter().collect(); terms.sort_by(|a, b| b.1.cmp(&a.1)); terms.truncate(max_features);
let vocabulary: Vec<String> = terms.iter().map(|(w, _)| w.clone()).collect(); let word_to_idx: HashMap<String, usize> = vocabulary .iter() .enumerate() .map(|(i, w)| (w.clone(), i)) .collect();
// Build TF-IDF matrix let v = vocabulary.len(); let mut matrix = vec![vec![0.0f64; v]; n_docs];
for (d, doc) in documents.iter().enumerate() { let words: Vec<&str> = doc.split_whitespace().collect(); let n = words.len() as f64; let mut counts: HashMap<&str, f64> = HashMap::new(); for w in &words { *counts.entry(w).or_insert(0.0) += 1.0; } for (word, count) in counts { if let Some(&idx) = word_to_idx.get(word) { let tf = count / n; let df = *doc_freq.get(word).unwrap_or(&1) as f64; let idf = (n_docs as f64 / (1.0 + df)).ln(); matrix[d][idx] = tf * idf; } } }
Self { matrix, vocabulary, word_to_idx } }}
// --- NMF ---
pub struct NmfModel { pub w: Vec<Vec<f64>>, // Document-topic (M x K) pub h: Vec<Vec<f64>>, // Topic-word (K x V) pub n_topics: usize,}
impl NmfModel { pub fn fit(dtm: &DocumentTermMatrix, n_topics: usize, max_iter: usize) -> Self { let m = dtm.matrix.len(); let v = dtm.vocabulary.len();
// Initialize W and H with small positive values let mut w = vec![vec![0.0f64; n_topics]; m]; let mut h = vec![vec![0.0f64; v]; n_topics];
// Simple random initialization for i in 0..m { for k in 0..n_topics { w[i][k] = 0.1 + (((i * 7 + k * 13) % 100) as f64) / 1000.0; } } for k in 0..n_topics { for j in 0..v { h[k][j] = 0.1 + (((k * 11 + j * 3) % 100) as f64) / 1000.0; } }
// Multiplicative update rules for _ in 0..max_iter { // Update H: H <- H * (Wᵀ A) / (Wᵀ W H) for k in 0..n_topics { for j in 0..v { let mut num = 0.0; let mut den = 0.0; for i in 0..m { num += w[i][k] * dtm.matrix[i][j]; } for i in 0..m { let mut wh = 0.0; for kk in 0..n_topics { wh += w[i][kk] * h[kk][j]; } den += w[i][k] * wh; } if den > 1e-10 { h[k][j] *= num / den; } } }
// Update W: W <- W * (A Hᵀ) / (W H Hᵀ) for i in 0..m { for k in 0..n_topics { let mut num = 0.0; let mut den = 0.0; for j in 0..v { num += dtm.matrix[i][j] * h[k][j]; } for j in 0..v { let mut wh = 0.0; for kk in 0..n_topics { wh += w[i][kk] * h[kk][j]; } den += wh * h[k][j]; } if den > 1e-10 { w[i][k] *= num / den; } } } }
Self { w, h, n_topics } }
pub fn get_top_words(&self, topic: usize, n: usize, vocab: &[String]) -> Vec<(String, f64)> { let mut indices: Vec<usize> = (0..vocab.len()).collect(); indices.sort_by(|&a, &b| { self.h[topic][b].partial_cmp(&self.h[topic][a]).unwrap() }); indices.truncate(n); indices .iter() .map(|&i| (vocab[i].clone(), self.h[topic][i])) .collect() }
pub fn get_document_topics(&self, doc_idx: usize) -> Vec<f64> { let row = &self.w[doc_idx]; let sum: f64 = row.iter().sum(); if sum > 0.0 { row.iter().map(|v| v / sum).collect() } else { vec![0.0; self.n_topics] } }}
// --- Narrative Signal Generator ---
pub struct NarrativeSignalGenerator { client: Client, base_url: String,}
impl NarrativeSignalGenerator { pub fn new() -> Self { Self { client: Client::new(), base_url: "https://api.bybit.com".to_string(), } }
pub async fn fetch_price(&self, symbol: &str) -> Result<f64> { let url = format!( "{}/v5/market/kline?category=spot&symbol={}&interval=D&limit=2", self.base_url, symbol ); let resp: BybitResponse = self.client.get(&url).send().await?.json().await?; let close: f64 = resp.result.list[0][4].parse()?; Ok(close) }
pub fn compute_signal( topic_prevalence: &[Vec<f64>], topic_idx: usize, ) -> f64 { if topic_prevalence.len() < 2 { return 0.0; } let n = topic_prevalence.len(); let current = topic_prevalence[n - 1][topic_idx]; let previous: f64 = topic_prevalence[..n - 1] .iter() .map(|p| p[topic_idx]) .sum::<f64>() / (n - 1) as f64; let momentum = current - previous; momentum.clamp(-1.0, 1.0) }}
// --- Main ---
#[tokio::main]async fn main() -> Result<()> { let mut corpus = CryptoCorpus::new();
corpus.add_document( "Yield farming on Uniswap and Aave generating incredible APY DeFi future", "reddit", "2024-06-01", vec!["UNI".into(), "AAVE".into()], ); corpus.add_document( "Bitcoin ETF approval imminent BlackRock Fidelity institutional adoption", "reddit", "2024-06-02", vec!["BTC".into()], ); corpus.add_document( "Layer 2 scaling Arbitrum Optimism reducing gas fees L2 adoption", "reddit", "2024-06-03", vec!["ARB".into(), "OP".into()], ); corpus.add_document( "Solana DeFi ecosystem growing Jupiter DEX Marinade staking", "reddit", "2024-06-04", vec!["SOL".into()], ); corpus.add_document( "AI crypto convergence next narrative Render Fetch leading projects", "reddit", "2024-06-05", vec!["RNDR".into(), "FET".into()], );
let dtm = DocumentTermMatrix::from_documents(corpus.get_documents(), 200); println!("Vocabulary size: {}", dtm.vocabulary.len()); println!("Documents: {}", dtm.matrix.len());
let nmf = NmfModel::fit(&dtm, 3, 100);
for k in 0..nmf.n_topics { let words = nmf.get_top_words(k, 5, &dtm.vocabulary); let word_str: Vec<String> = words .iter() .map(|(w, s)| format!("{}({:.3})", w, s)) .collect(); println!("Topic {}: {}", k, word_str.join(", ")); }
for d in 0..corpus.get_documents().len() { let topics = nmf.get_document_topics(d); let topic_str: Vec<String> = topics.iter().map(|t| format!("{:.3}", t)).collect(); println!("Doc {}: [{}]", d, topic_str.join(", ")); }
// Signal generation let gen = NarrativeSignalGenerator::new(); let price = gen.fetch_price("BTCUSDT").await?; println!("BTC price: {:.2}", price);
Ok(())}Project Structure
ch15_crypto_topic_modeling/├── Cargo.toml├── src/│ ├── lib.rs│ ├── models/│ │ ├── mod.rs│ │ ├── lda.rs│ │ └── nmf.rs│ ├── corpus/│ │ ├── mod.rs│ │ └── crypto_corpus.rs│ └── trading/│ ├── mod.rs│ └── narrative_signals.rs└── examples/ ├── whitepaper_topics.rs ├── narrative_tracking.rs └── topic_alpha.rsSection 7: Practical Examples
Example 1: Narrative Discovery from Reddit
We collect 50,000 posts from r/cryptocurrency over 6 months (2024-H1) and fit a 10-topic NMF model. The discovered topics align remarkably well with known crypto narratives:
Topic Top Words Interpretation0 bitcoin, etf, blackrock, institutional, Bitcoin ETF narrative approval, sec, spot, filing1 defi, yield, farming, liquidity, aave, DeFi revival uniswap, protocol, tvl2 layer, scaling, rollup, arbitrum, L2 scaling narrative optimism, zk, gas, fees3 nft, collection, marketplace, floor, NFT recovery opensea, blur, digital, art4 solana, sol, ecosystem, jupiter, Solana ecosystem meme, bonk, speed, tps5 ai, artificial, intelligence, render, AI/Crypto convergence fetch, compute, gpu, decentralized6 regulation, sec, lawsuit, ripple, Regulatory narrative compliance, legal, court7 staking, liquid, lido, ethereum, Liquid staking validator, eth, rocket, pool8 meme, doge, shib, pepe, community, Meme coin season bonk, floki, viral9 bridge, cross, chain, interoperability, Cross-chain/Interop cosmos, polkadot, layerzero
Coherence scores (C_v): NMF: 0.52 LDA: 0.47 LSI: 0.38NMF produces the most interpretable topics for crypto text, consistent with its advantage for short-to-medium documents with clear thematic separation.
Example 2: Narrative Lifecycle Tracking
We track the prevalence of the “Bitcoin ETF” topic (Topic 0) over 26 weekly periods:
Week Prevalence Phase Price Action (BTC)W1 0.08 Emergence $42,000W4 0.12 Acceleration $44,500W8 0.18 Acceleration $47,200W12 0.31 Peak $52,800W14 0.35 Peak $69,000 (ATH on ETF approval)W16 0.28 Early Decay $63,500W20 0.15 Decay $58,000W24 0.07 Maturation $61,000
Correlation (prevalence vs. BTC return): Concurrent: r = 0.42 (p < 0.05) 1-week lead: r = 0.38 (p < 0.05) 2-week lead: r = 0.21 (not significant)The narrative lifecycle is clearly visible: emergence -> acceleration -> peak -> decay. The peak in narrative prevalence (W14) closely coincided with the price peak following actual ETF approval. Narrative prevalence has significant concurrent and 1-week leading correlation with returns.
Example 3: Topic-Based Alpha Generation
We use weekly topic prevalence vectors as features in a ridge regression model predicting next-week returns for the top-20 tokens:
Feature Group R² (OOS) Alpha (annual) t-statPrice-only baseline 0.02 0.0% N/ATopic features only 0.05 3.8% 2.14Price + Topic 0.08 5.2% 2.67Price + Topic + Vol 0.11 6.1% 2.89
Information coefficient by topic: Bitcoin ETF (topic 0): IC = 0.08 (significant for BTC, ETH) DeFi (topic 1): IC = 0.11 (significant for AAVE, UNI, COMP) Meme coin (topic 8): IC = 0.14 (significant for DOGE, SHIB, PEPE) AI/Crypto (topic 5): IC = 0.12 (significant for RNDR, FET)Topic features provide statistically significant alpha (t-stat > 2) over price-only baselines. The information coefficient is highest for meme coins (topic 8), consistent with these assets being most narrative-driven.
Section 8: Backtesting Framework
Components
- Data Pipeline: Bybit API for OHLCV, yfinance for benchmark indices. Reddit data via stored archives or API.
- Corpus Builder: Weekly rolling corpus construction with crypto-specific preprocessing.
- Topic Engine: NMF or LDA fitted on rolling 12-week windows, producing K topic prevalence time series.
- Signal Generator: Narrative momentum (prevalence change), narrative acceleration, topic-token association scores.
- Portfolio Constructor: Long tokens in accelerating narratives, underweight tokens in decaying narratives.
- Execution Simulator: 10 bps slippage, 5 bps commission, weekly rebalance.
Metrics
| Metric | Description |
|---|---|
| CAGR | Compound Annual Growth Rate |
| Sharpe Ratio | Risk-adjusted return (annualized) |
| Information Coefficient | Correlation between predicted and actual returns |
| Topic Coherence | Quality of discovered topics (C_v score) |
| Narrative Lead Time | How far in advance topic signals predict price moves |
| Topic Stability | Jaccard similarity of topic word sets across rolling windows |
| Alpha Decay | Time (weeks) after which topic-based alpha loses significance |
Sample Backtest Results
Strategy CAGR Sharpe Max DD ICEqual Weight (baseline) 18.2% 0.61 -52.3% N/ANarrative Momentum (NMF) 27.8% 1.12 -34.2% 0.09Narrative Momentum (LDA) 24.3% 0.98 -37.1% 0.07Contrarian Narrative Decay 16.5% 1.31 -21.8% 0.06Topic Alpha (Ridge Regression) 29.4% 1.24 -30.5% 0.11Combined (Momentum + Alpha) 32.1% 1.38 -28.3% 0.12
Period: 2022-01-01 to 2024-12-31Universe: Top 30 tokens by market capTopic model: NMF, K=10, retrained monthlyRebalance: WeeklySection 9: Performance Evaluation
Method Comparison
| Criterion | LSI | LDA | NMF | Dynamic LDA | BERTopic |
|---|---|---|---|---|---|
| Topic Interpretability | Low | High | Very High | High | Very High |
| Computational Cost | Low | Medium | Low | High | High |
| Temporal Stability | Medium | Low | High | Medium | Medium |
| Short Text Performance | Medium | Poor | Good | Poor | Excellent |
| Alpha Generation | Low | Medium | High | Medium | High |
| Setup Complexity | Low | Medium | Low | High | High |
Key Findings
- NMF is the best default for crypto topic modeling: It produces more interpretable topics than LDA, is faster to train, and generates slightly better trading signals. The non-negativity constraint aligns with how narratives work (additive, not subtractive).
- 10 topics is approximately optimal: Coherence peaks at 8-12 topics for a broad crypto corpus. Fewer topics merge distinct narratives; more topics produce redundant or uninterpretable topics.
- Topic-based signals have a 1-3 week horizon: Narrative momentum predicts returns 1-3 weeks ahead. Beyond 4 weeks, the signal decays to noise.
- Topic stability matters for production: NMF topics are more stable across rolling windows (Jaccard similarity ~0.65) than LDA topics (~0.45). Unstable topics produce noisy trading signals.
- Narrative-driven alpha is concentrated in smaller tokens: The information coefficient is highest for mid-cap and small-cap tokens, which are more sensitive to narrative flows than BTC or ETH.
Limitations
- Topic models require substantial text volume; in periods of low social media activity, topics become unreliable.
- Reddit and Twitter data access is increasingly restricted and expensive.
- Topic models conflate co-occurrence with semantic meaning; “bitcoin” and “scam” co-occurring doesn’t mean they are semantically related.
- Dynamic topic models are computationally expensive and difficult to deploy in real-time production systems.
- The relationship between narrative prevalence and returns is nonlinear — extreme prevalence often signals a top, not continued upside.
- Topic models cannot capture sarcasm, irony, or nuanced sentiment within a topic.
Section 10: Future Directions
-
Neural topic models (BERTopic, CTM): Replace bag-of-words representations with contextual embeddings from transformer models, producing topic representations that capture semantic nuance beyond word co-occurrence.
-
Real-time narrative dashboards: Build streaming topic models that update in real time as new Reddit posts and tweets arrive, providing continuous narrative prevalence monitoring for traders.
-
Causal narrative analysis: Use Granger causality and structural equation models to determine whether narrative shifts cause price movements or merely reflect them, enabling more precise signal timing.
-
Cross-language narrative tracking: Extend topic models to multilingual corpora (English + Chinese + Korean) to capture narrative emergence in non-English communities before it reaches English-speaking markets.
-
On-chain narrative signals: Combine text-based topic models with on-chain activity data (DEX volume, TVL changes, wallet creation rates) to build multi-modal narrative indicators that are harder to game through social media manipulation.
-
Narrative contagion modeling: Apply epidemiological models (SIR, SEIR) to narrative spread, treating social media users as susceptible, infected, or recovered with respect to each narrative — predicting peak timing and decay rates for more precise entry/exit signals.
References
-
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
-
Lee, D. D., & Seung, H. S. (1999). Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature, 401(6755), 788-791.
-
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407.
-
Blei, D. M., & Lafferty, J. D. (2006). Dynamic Topic Models. Proceedings of the 23rd International Conference on Machine Learning, 113-120.
-
Grootendorst, M. (2022). BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv preprint arXiv:2203.05794.
-
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. WSDM 2015, 399-408.
-
Shiller, R. J. (2019). Narrative Economics: How Stories Go Viral and Drive Major Economic Events. Princeton University Press.