PrismLLM

Problem

A sparse MoE model like DeepSeek activates only a handful of its experts per token, yet the whole model has to be resident to serve it. That gap between what is stored and what is used per token is where an inference engine has room to be clever on hardware that cannot fit the dense weights comfortably.

Architecture

A Rust inference engine exploring several ideas against that gap: mixed-precision quantisation that keeps hot experts at higher precision and cold ones low, low-rank (SVD) compression of expert weights, a shadow router that predicts the next experts a layer or two ahead so they can be prefetched, and NUMA-aware allocation for APU systems. The interesting work is the architecture of where precision and bandwidth are spent, not a leaderboard number.

A few experts active per token, at mixed precision

Outcome

This is a research direction, not a benchmarked product. The repository’s headline throughput figures are a design target, not a measured result (a 671B-parameter model does not fit in a consumer GPU’s memory even at two bits), and I keep that distinction explicit. The writeup covers the design ideas honestly: Ideas for running big MoE models on small hardware.