Ideas for running big MoE models on small hardware

prismLLM is where I park ideas for one specific problem: a sparse mixture-of-experts model has hundreds of billions of total parameters but only activates a small fraction per token. The compute per token is modest. The memory footprint is not. So the question that interests me is narrow and concrete: how much of that giant parameter set can you keep off the hot path, and still feed the GPU fast enough to matter?

I want to be upfront about what this is. The repo carries a target table with large throughput numbers for very large models on a single consumer GPU. Treat those as a design target I was sketching toward, not a measured result. A 671B-parameter model does not fit in 32 GB of VRAM even at 2-bit, and I am not going to pretend otherwise. What follows is the part worth keeping: the design ideas, why each one helps, and where each one stops helping.

The actual constraint

For a sparse MoE, decode is memory-bound, not compute-bound. Each token routes to a few experts; the FLOPs are small. The bottleneck is moving the right expert weights into fast memory before the GPU needs them. Every design choice below is a different answer to one question: how do you make the working set small, or make fetching it cheap enough to hide?

Mixed-precision experts

Not all experts deserve the same number of bits. Routing in a trained MoE is heavily skewed: a handful of experts fire constantly, a long tail fires rarely. So keep the hot experts at higher precision where quality is sensitive, and push the cold tail down to aggressive quantization where the occasional accuracy hit costs little because the path is taken rarely.

The non-obvious part is that you can drive the precision assignment from observed routing statistics, not from a static guess. Profile which experts a representative workload actually selects, weight precision by activation frequency, and you spend your bit budget where tokens spend their time. The failure mode to watch: a distribution shift at inference where the “cold” experts you starved suddenly become hot, and quality drops on exactly the inputs you did not profile.

Low-rank expert compression

Expert weight matrices in a trained model are often closer to low-rank than the parameter count suggests. Factor an expert into two thin matrices and you store and move far less per expert, at the cost of an extra small matmul. For a memory-bound decode loop that trade is usually favorable: you are spending cheap compute to buy back scarce bandwidth.

This composes with mixed precision rather than competing with it. Rank is one knob, bit-width is another, and the cold tail is where you can be aggressive on both. The limit is reconstruction error compounding across layers, so the rank floor has to be set per-layer from real activations, not uniformly.

Predictive expert prefetch

If experts live in slower memory (system RAM, or a second NUMA node) and stream into VRAM on demand, the naive version stalls every token waiting on the fetch. The idea is to predict the next token’s likely experts from the router’s behavior on the current context and prefetch them while the current token is still computing, so the transfer hides behind compute instead of blocking it.

Routing has real temporal structure inside a single generation, which is what makes a cheap predictor worth running. The honest caveat: a mispredict costs you a wasted transfer plus the on-demand stall you were trying to avoid, so the predictor has to be cheap and the fetch has to be cancellable. A predictor that is wrong half the time is worse than no predictor.

NUMA-aware allocation

On an APU or a dual-socket box, “system memory” is not uniform. Pulling expert weights across a NUMA boundary on the critical path adds latency you cannot hide cheaply. The idea is to pin the hottest experts to the memory closest to the compute that consumes them, and let only the cold tail live across the link where its rare access pays the penalty. It is the same hot/cold split as the precision and rank decisions, applied to physical placement instead of bits.

Why I keep it framed as exploration

It would be easy to write a confident benchmark post here. The ideas are real and individually defensible, which is exactly why it is tempting to overclaim the sum of them. But “we made a 671B model run on a 5090” is a memory claim that arithmetic refutes before any benchmark runs. The useful contribution is the decomposition: separate the experts that matter from the ones that do not, and spend precision, rank, bandwidth, and locality accordingly. That framework survives whether or not any single throughput number does.

A few experts active per token, at mixed precision