<<< Back to Index


What Is a Mixture of Experts Model?


_A plain-English explainer for people who want to run LLMs locally._


_Last updated: 2026-03-02_


The Short Version


A Mixture of Experts (MoE) model is a neural network where only a small fraction of the total parameters are used for any given input. Instead of one giant brain doing everything, it's more like a team of specialists — and a router decides which specialists to consult for each question.


This means a 700 billion parameter MoE model might only activate 40 billion parameters per token. You get the quality of a huge model with the speed of a much smaller one.


How It Works


The Three Key Parts


1. Experts — Individual sub-networks (typically feed-forward layers) that each specialize in different types of information. A model might have 64, 128, or even 256 experts.


2. Router (Gate) — A small network that looks at each input token and decides which experts should handle it. The router outputs a probability distribution over experts and selects the top-K.


3. Top-K Selection — Only K experts (typically 2-8) are activated per token. The rest sit idle. Their outputs are weighted by the router's confidence scores and combined.


A Concrete Example: GLM-5


GLM-5 has 744 billion total parameters, but only ~40 billion are active per token:



The Math (Simplified)


For a dense 70B model:


For a 235B MoE model with 22B active:


Why MoE Matters for Local LLM Users


The Good


Much faster inference — Since only a fraction of parameters are active, you get significantly fewer floating-point operations per token. A 235B MoE model can be faster than a dense 70B model while potentially being smarter.


Better quality per compute — MoE models can store more knowledge across their experts. Each expert can specialize in certain domains (code, math, languages, etc.), leading to better performance across diverse tasks.


Scaling without proportional cost — You can make models much bigger without the inference cost scaling linearly. This is why we're seeing 400B–744B parameter open models that are actually runnable.


The Bad


Memory is still proportional to total parameters — Even though only 40B parameters are active, you still need to store all 744B parameters in memory (or swap them). This is the primary constraint for local deployment.


Router overhead — The routing computation adds a small fixed cost per token. Negligible for large models, but measurable for very small ones.


Expert load balancing — If the router consistently favors certain experts while ignoring others, you waste parameters. Training MoE models requires careful auxiliary loss functions to encourage balanced expert usage.


Quantization is more complex — Different experts may have different sensitivity to quantization. Uniform quantization (same bits for all experts) may disproportionately hurt some specialists.


The Game-Changing Part for Unified Memory


Here's why MoE is a big deal for machines with shared CPU/GPU memory:


MoE offloading: Since only a few experts are active per token, you can keep the active experts in fast memory (GPU/unified) and store inactive experts in slower memory (CPU RAM or even NVMe). The performance hit is minimal because you only need to load 2-8 experts per token, not all 128+.


On a discrete GPU (like an RTX 4090), this means storing experts in system RAM and shuttling them across the PCIe bus — which is slow.


On a unified memory machine (like Strix Halo or Apple Silicon), there's no bus to cross. CPU and GPU share the same physical memory. Loading inactive experts is nearly as fast as accessing active ones. This is why unified memory machines are disproportionately good at running MoE models.


MoE Total vs Active Parameters

Dense vs MoE Speed


Dense vs MoE: Side-by-Side


PropertyDense ModelMoE Model
ExampleLlama 3.3 70BQwen3-235B-A22B
Total parameters70B235B
Active parameters/token70B (all)22B
Inference speedBaseline~3x faster (fewer active params)
Memory for weights (Q4)~40 GB~130 GB
Memory for weights (Q2)~24 GB~80 GB
QualityStrongPotentially stronger (more total knowledge)
Best hardware fitHigh-bandwidth GPULarge unified memory

Flagship Models Memory at Different Quantizations


The Current MoE Landscape (2026)


ModelTotal ParamsActive/TokenExpertsQ4 MemoryQ2 Memory
Qwen3-30B-A3B30B3B128 (8 active)~18 GB~11 GB
Qwen3-80B-A3B80B3B128 (8 active)~48 GB~28 GB
Llama 4 Scout109B17B16~58 GB~35 GB
Qwen3 235B-A22B235B22B128 (8 active)~130 GB~80 GB
Llama 4 Maverick400B17B128~220 GB~130 GB
DeepSeek R1 / V3.1671B~37B256~400 GB~176 GB
GLM-5744B~40BMoE~430 GB~241 GB

Model Requirements vs Hardware Limits


Practical Advice: MoE on Your Hardware


If you have 24GB VRAM (RTX 4090):


If you have 128GB unified memory (Strix Halo / M4 Max):


If you have 192GB+ (M3 Ultra):


If you have 512GB (M3 Ultra maxed):


Key Takeaway


MoE models are the reason 700B+ parameter models are even remotely possible to run locally. They trade memory capacity for compute efficiency — and unified memory machines with large RAM pools are the ideal hardware to exploit this tradeoff.


If you're buying hardware for local LLM inference in 2026 and care about running the biggest open models, buy memory, not compute. A $2,000 mini PC with 128GB of unified RAM will run bigger models than a $20,000 multi-GPU server with 96GB of split VRAM.




Further Reading





Sources



<<< Back to Index