Chapter 211: Neural Architecture Search (NAS) for Trading

1. Introduction

Designing neural network architectures for financial trading has traditionally been an art as much as a science. Practitioners spend weeks or months manually experimenting with different layer configurations, activation functions, and connectivity patterns to find models that capture the subtle statistical regularities in market data. Neural Architecture Search (NAS) offers a principled, automated alternative: instead of relying on human intuition and trial-and-error, NAS algorithms systematically explore the space of possible architectures to discover designs that optimize trading-relevant objectives such as prediction accuracy, Sharpe ratio, or risk-adjusted return.

The core idea behind NAS is deceptively simple. We define a search space of candidate architectures, a search strategy for navigating that space, and a performance estimation method for evaluating each candidate. The search algorithm proposes architectures, evaluates them on trading data, and iteratively refines its proposals based on observed performance. Over many iterations, the process converges on architectures that are specifically tailored to the structure of financial time series.

What makes NAS particularly compelling for trading is that financial data has unique characteristics — non-stationarity, heavy tails, regime changes, low signal-to-noise ratios — that may favor architectural patterns quite different from those discovered on standard benchmarks like image classification. A NAS system can discover these domain-specific patterns automatically, potentially finding architectures that a human designer would never consider.

In this chapter, we develop a complete NAS framework in Rust, designed specifically for trading applications. We implement evolutionary search over a flexible architecture space that includes dense layers, convolutional layers, recurrent layers, and attention mechanisms. We integrate real market data from the Bybit exchange and demonstrate how to discover architectures optimized for price prediction.

2. Mathematical Foundation

2.1 Search Space Definition

The search space $\mathcal{A}$ defines the set of all architectures that the NAS algorithm can consider. We represent each architecture as a directed acyclic graph (DAG) where nodes are computational operations and edges represent data flow. Formally, an architecture $a \in \mathcal{A}$ is encoded as a sequence of genes:

$$a = (g_1, g_2, \ldots, g_L)$$

where $L$ is the maximum number of layers and each gene $g_i$ specifies:

Layer type $t_i \in {$ Dense, Conv1D, LSTM, Attention, Skip, Identity $}$
Hidden dimension $h_i \in {16, 32, 64, 128, 256}$
Activation function $\sigma_i \in {$ ReLU, Tanh, Sigmoid, GELU, LeakyReLU $}$
Skip connection target $s_i \in {0, 1, \ldots, i-1}$ (connect to a previous layer or none)
Dropout rate $d_i \in {0.0, 0.1, 0.2, 0.3, 0.5}$

The total size of the search space is:

$$|\mathcal{A}| = \prod_{i=1}^{L} |T| \cdot |H| \cdot |\Sigma| \cdot i \cdot |D|$$

For $L=6$ layers, this yields over $10^{10}$ possible architectures, making exhaustive search infeasible.

2.2 Search Strategies

Reinforcement Learning (RL)-based Search. A controller network (typically an RNN) generates architecture descriptions as sequences of tokens. The controller is trained with REINFORCE, using the validation performance of each generated architecture as the reward signal:

$$\nabla_\theta J(\theta) = \mathbb{E}{a \sim \pi\theta}[\nabla_\theta \log \pi_\theta(a) \cdot (R(a) - b)]$$

where $\pi_\theta$ is the controller policy, $R(a)$ is the reward (e.g., validation Sharpe ratio), and $b$ is a baseline for variance reduction.

Evolutionary Search. A population of architectures is maintained and evolved through mutation and crossover operators. At each generation:

Tournament selection: select $k$ individuals, keep the best as parent
Mutation: randomly modify one or more genes of the parent
Crossover: combine genes from two parents to produce offspring
Evaluation: train and evaluate the offspring on validation data
Replacement: insert offspring into the population, removing the weakest

The fitness function $f(a)$ can incorporate multiple objectives:

$$f(a) = \alpha \cdot \text{Accuracy}(a) - \beta \cdot \text{ModelSize}(a) - \gamma \cdot \text{Latency}(a)$$

Gradient-based Search (DARTS). The discrete search space is relaxed to a continuous one by placing a softmax over operation choices:

$$\bar{o}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o)}{\sum_{o’} \exp(\alpha_{o’})} \cdot o(x)$$

Architecture parameters $\alpha$ and network weights $w$ are optimized jointly via gradient descent.

2.3 Performance Estimation

Fully training each candidate architecture is expensive. Several strategies reduce cost:

Weight sharing: all architectures share a single set of weights in a supernet
Early stopping: terminate training early for unpromising architectures
Proxy tasks: evaluate on a subset of data or a simplified version of the task
Performance predictors: train a surrogate model to predict architecture performance from its encoding

The predicted performance $\hat{R}(a)$ approximates the true performance $R(a)$:

$$\hat{R}(a) = f_\phi(\text{encode}(a))$$

where $f_\phi$ is a learned predictor (e.g., a small neural network or Gaussian process).

3. NAS Search Spaces for Trading

3.1 Layer Types

Financial time series require architectures that can capture both local patterns and long-range dependencies:

Dense (Fully Connected): capture arbitrary nonlinear relationships between features at a given time step
Conv1D: detect local temporal patterns such as candlestick formations, momentum signals, and short-term mean reversion
LSTM/GRU: model sequential dependencies and regime persistence across time steps
Multi-Head Attention: learn which historical time steps are most relevant for current predictions, regardless of temporal distance
Identity/Skip: allow information to bypass layers, enabling both shallow and deep effective architectures within the same search space

3.2 Skip Connections

Skip connections are especially important for trading models because:

Financial signals are often weak and can be destroyed by too many nonlinear transformations
Different features may require different amounts of processing
Residual connections improve gradient flow during training, which is critical when the target signal has low signal-to-noise ratio

Our search space allows each layer to optionally connect to any earlier layer, enabling the discovery of complex multi-scale architectures.

3.3 Activation Functions

Different activation functions serve different roles:

ReLU/LeakyReLU: efficient for general-purpose feature extraction, with LeakyReLU avoiding dead neuron problems
Tanh: bounded output suitable for layers that feed into recurrent units
Sigmoid: useful for gating mechanisms and probability outputs
GELU: smooth approximation to ReLU, often effective in attention-based architectures

The NAS system discovers which activations work best at each layer of the network for trading data.

4. Trading Applications

4.1 Price Prediction

For price direction or return prediction, the NAS objective is typically:

$$\max_{a \in \mathcal{A}} \text{Accuracy}(a, \mathcal{D}_{\text{val}}) \quad \text{s.t.} \quad \text{Params}(a) \leq B$$

where $\mathcal{D}_{\text{val}}$ is out-of-sample validation data and $B$ is a parameter budget. The search discovers architectures that balance model capacity with the risk of overfitting to noise in financial data.

4.2 Volatility Forecasting

Volatility prediction requires architectures sensitive to the clustering and persistence of variance. NAS can discover models that naturally capture GARCH-like dynamics through appropriate combinations of recurrent and attention layers. The fitness function may use mean squared error on realized volatility or a likelihood-based metric.

4.3 Portfolio Optimization

For portfolio construction, the NAS objective becomes:

$$\max_{a \in \mathcal{A}} \text{SharpeRatio}\left(\text{Portfolio}(a, \mathcal{D}_{\text{val}})\right)$$

Architectures that output portfolio weights must satisfy constraints (e.g., weights sum to one, no excessive leverage), which can be enforced through appropriate output layers that are fixed across the search space.

4.4 Multi-Objective Search

In practice, trading models must balance multiple objectives: prediction accuracy, computational cost (for real-time execution), model size (for deployment), and robustness (for regime changes). We track the Pareto front of non-dominated architectures:

An architecture $a$ dominates $a’$ if $a$ is at least as good on all objectives and strictly better on at least one. The Pareto front contains all non-dominated architectures, giving the trader a menu of optimal trade-offs.

5. Efficient NAS

Weight sharing dramatically reduces the cost of NAS by training a single supernet that contains all possible architectures as sub-networks. Each candidate architecture is evaluated by extracting the corresponding subset of weights from the supernet, avoiding the need to train each architecture from scratch.

The supernet weights $W$ are trained by sampling random architectures at each training step:

$$W^* = \arg\min_W \mathbb{E}{a \sim \mathcal{U}(\mathcal{A})} [\mathcal{L}(a, W, \mathcal{D}{\text{train}})]$$

5.2 One-Shot Methods

One-shot NAS trains the supernet once and then searches for the best sub-network. This reduces search cost from thousands of GPU-hours to a single training run plus a cheap search phase. The key challenge is ensuring that sub-network performance in the supernet correlates with standalone performance.

5.3 Proxy Tasks for Trading

To further reduce search cost, we use proxy tasks:

Data subsampling: search on a random 10-20% subset of the full trading history
Reduced training epochs: train each architecture for 5-10 epochs instead of 50-100
Simplified features: use only price and volume instead of the full feature set during search
Shorter lookback windows: use 30-day windows instead of 90-day windows during search

After the search identifies promising architectures, the top candidates are fully trained and evaluated on the complete dataset.

6. Implementation Walkthrough

Our Rust implementation provides a complete NAS framework for trading. The key components are:

6.1 Architecture Encoding

Each architecture is encoded as a vector of LayerGene structs, where each gene specifies the layer type, hidden size, activation function, skip connection target, and dropout rate. This encoding supports efficient mutation and crossover operations.

pub struct LayerGene {
    pub layer_type: LayerType,
    pub hidden_size: usize,
    pub activation: Activation,
    pub skip_connection: Option<usize>,
    pub dropout: f64,
}

6.2 Evolutionary Search

The search proceeds through generations. At each generation:

Tournament selection picks parents from the current population
Mutation randomly modifies one gene in a parent architecture
Each offspring is evaluated by simulating training on historical data
The population is updated, keeping the best individuals

The evolutionary approach is particularly well-suited for trading NAS because:

It naturally supports multi-objective optimization via Pareto ranking
It maintains population diversity, reducing the risk of premature convergence
It requires no differentiable relaxation of the search space

6.3 Bybit Data Integration

We fetch real OHLCV data from the Bybit API for architecture evaluation. The data is preprocessed into features (returns, moving averages, volatility) and binary labels (price direction), providing a realistic evaluation environment for discovered architectures.

6.4 Architecture Evaluation

Each candidate architecture is evaluated through a simplified forward pass simulation. In production, this would involve full backpropagation training, but our implementation demonstrates the NAS framework structure using a proxy evaluation that estimates performance based on architecture properties and a lightweight computation on the data.

7. Bybit Data Integration

The integration with the Bybit exchange provides real market data for architecture evaluation. We use the public klines endpoint:

GET https://api.bybit.com/v5/market/kline?category=linear&symbol=BTCUSDT&interval=60&limit=200

The returned OHLCV candles are processed into features:

Log returns: $r_t = \ln(P_t / P_{t-1})$
Moving averages: SMA and EMA over various windows
Volatility: rolling standard deviation of returns
Volume ratios: current volume relative to moving average

These features form the input to candidate architectures during the NAS evaluation phase. Using real market data ensures that discovered architectures are adapted to the actual statistical properties of cryptocurrency markets, including their characteristic volatility clustering, fat-tailed returns, and microstructure patterns.

8. Key Takeaways

NAS automates architecture design: instead of manually experimenting with network configurations, NAS systematically searches through the space of possible architectures to find designs optimized for trading tasks.
Trading requires specialized search spaces: financial time series have unique properties (non-stationarity, low SNR, regime changes) that favor different architectural patterns than those found on standard benchmarks. Include Conv1D for local patterns, LSTM for sequential dependencies, and attention for long-range relationships.
Evolutionary NAS is practical for trading: evolutionary algorithms naturally support multi-objective optimization, maintain diversity, and scale well to complex search spaces without requiring differentiable relaxations.
Efficiency techniques are essential: weight sharing, proxy tasks, and early stopping reduce the computational cost of NAS from thousands of GPU-hours to practical levels. For trading, data subsampling and reduced training epochs are effective proxies.
Multi-objective optimization matters: trading models must balance prediction accuracy against model size, inference latency, and robustness. Pareto front tracking provides a principled way to navigate these trade-offs.
Rust provides performance advantages: the computational intensity of NAS makes Rust’s performance characteristics valuable, enabling faster architecture evaluation and shorter search cycles.
Real data evaluation is critical: architectures should be evaluated on actual market data (e.g., from Bybit) rather than synthetic data, to ensure they capture the true statistical properties of financial markets.
NAS is a complement, not a replacement: NAS discovers architecture topology, but feature engineering, data preprocessing, risk management, and execution logic still require domain expertise. NAS is most effective when combined with strong financial domain knowledge.

Chapter 211: Neural Architecture Search (NAS) for Trading

Chapter 211: Neural Architecture Search (NAS) for Trading

1. Introduction

2. Mathematical Foundation

2.1 Search Space Definition

2.2 Search Strategies

2.3 Performance Estimation

3. NAS Search Spaces for Trading

3.1 Layer Types

3.2 Skip Connections

3.3 Activation Functions

4. Trading Applications

4.1 Price Prediction

4.2 Volatility Forecasting

4.3 Portfolio Optimization

4.4 Multi-Objective Search

5. Efficient NAS

5.1 Weight Sharing

5.2 One-Shot Methods

5.3 Proxy Tasks for Trading

6. Implementation Walkthrough

6.1 Architecture Encoding

6.2 Evolutionary Search

6.3 Bybit Data Integration

6.4 Architecture Evaluation

7. Bybit Data Integration

8. Key Takeaways