ACTIVE2026Rust · MoE inference

PrismLLM

A research exploration into running large mixture-of-experts models on modest hardware: mixed-precision experts, low-rank compression, and predictive expert prefetch.

Problem

A sparse MoE model like DeepSeek activates only a handful of its experts per token, yet the whole model has to be resident to serve it. That gap between what is stored and what is used per token is where an inference engine has room to be clever on hardware that cannot fit the dense weights comfortably.

Architecture

A Rust inference engine exploring several ideas against that gap: mixed-precision quantisation that keeps hot experts at higher precision and cold ones low, low-rank (SVD) compression of expert weights, a shadow router that predicts the next experts a layer or two ahead so they can be prefetched, and NUMA-aware allocation for APU systems. The interesting work is the architecture of where precision and bandwidth are spent, not a leaderboard number.

token PER STEP router TOP-2 GATING SHADOW ROUTER SPECULATIVE e0 COLD Q2 e1 HOT Q6 e2 COLD Q2 e3 COLD Q2 e4 COLD Q2 e5 HOT Q6 e6 COLD Q2 e7 COLD Q2 PREFETCH NEXT
A few experts active per token, at mixed precision

Outcome

This is a research direction, not a benchmarked product. The repository’s headline throughput figures are a design target, not a measured result (a 671B-parameter model does not fit in a consumer GPU’s memory even at two bits), and I keep that distinction explicit. The writeup covers the design ideas honestly: Ideas for running big MoE models on small hardware.