Evolutionary search over a million strategy configs, and the overfitting trap

A search space of half a million-plus configurations sounds like the kind of problem you throw compute at. It is not, and that is the whole point. The danger in a search this large is not that it fails to find a good-looking strategy. It is that it finds thousands of them, all of which are good-looking because they memorized the noise in your backtest. Everything I care about in this project is a defense against that specific failure.

I will say it plainly first: this is a research and backtesting harness. Past backtest performance does not predict future returns, the repo’s expected-results table is a design target for the search, not a track record, and nothing here is financial advice. With that out of the way, the engineering is genuinely interesting.

Representation: a strategy as a genome

For a genetic algorithm to work, a strategy has to be encoded as something you can mutate and recombine. Here the genome spans three categories of parameters: the model architecture (hidden size, layer count, attention heads, sequence length), the training recipe (learning rate, batch size, epochs, weight decay), and the trading logic itself (confidence threshold, stop-loss and take-profit in pips, position sizing).

Folding architecture, training, and strategy into one genome is the decision that makes this neural architecture search and strategy search at once. A mutation might widen the model, or it might tighten the confidence threshold, and the search treats both as the same kind of move. The cost is a brutal evaluation step: scoring one genome means training a transformer from scratch and then backtesting it. There is no cheap fitness proxy. That single fact dominates every other design choice, because it means you cannot afford to evaluate a candidate twice or waste evaluations on obvious duplicates.

Fitness: a composite, not a single number

The fitness function is a weighted blend, not raw return:

plaintext

Fitness = 0.4 * Sharpe + 0.3 * WinRate + 0.3 * (1 - MaxDrawdown)

Optimizing raw return is how you breed a population of strategies that bet the account on one lucky regime. Sharpe rewards return per unit of risk, win rate rewards consistency, and the drawdown term punishes the strategies that look great right up until the one month they blow up. Combining them means the search cannot win by being reckless in a way that happened to pay off in the sample. Tuning those three weights is tuning what kind of strategy survives, which is why they are configurable rather than baked in.

The validation is where the search stays honest

This is the part that matters most. Each candidate is not scored on the data it trained on. It is backtested on five held-out weeks deliberately drawn from different market regimes: the Covid crash of March 2020, the start of the Ukraine war in February 2022, summer chop in July 2021, fall volatility in October 2023, and a recent slice from June 2024.

The regime choice is the actual anti-overfitting mechanism, and it is more thoughtful than a random train/test split. A random split leaks regime: if calm 2021 data sits in both train and validation, a strategy that only works in calm markets scores well and you never find out it is fragile. By forcing every candidate to perform across a crash, a geopolitical shock, a quiet chop, a volatile autumn, and recent data, the fitness signal rewards strategies that generalize across conditions and starves the ones that fit a single market mood. A strategy that survives all five regimes is not guaranteed to be good, but a strategy that only survives one is reliably bad, and the validation set is built to expose exactly that.

Evolution mechanics, and why exploration is explicit

Each generation keeps the top quarter of the population, recombines and mutates to fill most of the rest, and reserves a slice for fresh random configurations. That last part is the easy thing to skip and the wrong thing to skip. Pure selection-plus-crossover converges fast onto whatever neighborhood the early generations happened to favor, and in a million-point space that neighborhood is almost certainly not where the good strategies are. Injecting random explorers every generation keeps a channel open to regions the population would otherwise never reach. It trades some short-term fitness for not getting permanently stuck.

The model architecture is almost incidental to all of this. Swap the transformer for something else and the hard problems are unchanged: represent the candidate so it can be evolved, score it so honesty beats luck, and validate it so the search cannot lie to you. Get those three right and the search is worth running. Get them wrong and a bigger model just overfits faster.