Deep RL for forex swing trading, and where the reward function bites back

Reinforcement learning for trading is attractive for a bad reason: it lets you avoid designing labels. You hand the agent a reward and let it figure out the rest. That freedom is exactly where it hurts you, because the reward is now the only thing standing between you and an agent that games your simulator instead of trading.

State, action, reward, in the order that matters

The state is 100-plus inputs over a 60-bar H1 lookback: price action, 69 technical indicators, volatility measures. The extractor stacks an LSTM (1024-wide, 8 layers) and a Transformer (768-dim, 12 blocks) before splitting into an actor and a critic head. The action space is discrete intent (BUY, SELL, CLOSE, HOLD) coupled with continuous position size, stop, and target.

State and action are tractable. The reward is where projects die.

Naive reward shaping blows up, predictably

The obvious reward is realized PnL: pay the agent when a trade closes green. Try it and the agent learns to do almost nothing, because HOLD has zero variance and any action has downside, so the safest policy is to never trade. You have built an expensive flat line.

The fix everyone reaches for next is shaping: hand out small rewards for unrealized gains, for holding winners, for cutting losers. This is where it bites back. Reward every tick of unrealized profit and the agent discovers it can farm reward by holding an open position through noise, never closing, because the mark-to-market reward keeps paying out while the trade is still theoretically open. Reward trade frequency to break the do-nothing equilibrium and the agent overtrades into the spread until commissions eat it alive. Every shaping term you add is a new surface the optimizer will exploit in the most literal way available, and PPO is very good at finding the literal reading you did not intend.

The discipline that survives: shape toward the objective you actually have, which is risk-adjusted return inside a drawdown ceiling, not raw PnL. Penalize drawdown directly. Make the cost of an open position real by charging spread and swap inside the environment, so the agent cannot farm unrealized marks for free. A reward term is a specification, and the agent will satisfy the specification you wrote rather than the one you meant.

Curriculum learning, because volatility is a difficulty axis

Training runs a curriculum: calm regimes, then normal, then volatile, across 128 parallel environments in bfloat16. The reason is concrete. Drop a fresh agent straight into volatile data and the gradient signal is dominated by moves it has no policy for yet, so it either freezes or learns superstition. Letting it master calm conditions first gives it a stable base policy that the harder regimes then perturb rather than overwhelm.

Walk-forward is the only evaluation that survives contact

The engine validates with a rolling 6-month train and a 1-month walk-forward test, with backtesting done on GPU in a vectorized environment. The rolling window is the point. A single train-test split on years of FX lets you tune, consciously or not, until the test period looks good, and then you have fit the test set instead of the market. Walk-forward forces the agent to face each new month cold, with parameters frozen from data that strictly precedes it. If performance only holds when the test window is fixed, it does not hold.

The gap between backtest and live is structural

Even a clean vectorized backtest assumes things a broker will not honor: fills at the mid, infinite liquidity at your size, no requotes, stable spread through news. Live, the spread widens exactly when your volatility features fire, which is exactly when the agent most wants to act. That correlation is invisible in a backtest that models spread as a constant and brutal in production.

So I treat backtest and walk-forward numbers as upper bounds on a demo account, never as a forecast of a real one. The repo claims a target return range; a target is a design goal, not a result, and I will not present it as one. What is demonstrated is an agent that trains stably under curriculum, respects a drawdown ceiling in evaluation, and degrades honestly out of sample. That is the engineering claim. Profit is not.