Mixture of experts: the architecture redefining efficiency in LLMs

When Z.ai's GLM-5.1 took first place on SWE-Bench Pro in April 2026 with 744 billion total parameters but only 40 billion active per inference, many people looked at the number and did not understand what they were reading. That is the central point of the Mixture of Experts architecture — and why it matters to anyone operating AI infrastructure.

The problem MoE solves

Large language models are, fundamentally, dense neural networks. Every token processed activates all of the model's parameters. A 70-billion-parameter model uses 70 billion parameters for each word it processes — whether it is a simple question or a complex software engineering problem.

This is computationally expensive and, in many cases, unnecessary. You do not need a database specialist to answer a question about cooking. MoE solves exactly that.

How it works in practice

A Mixture of Experts architecture divides the network into "experts" — subsets of parameters trained for specific types of tasks. Each layer of the model has a router that, for every token, decides which experts to activate.

The result: the model has enormous total capacity on paper, but uses only a fraction of it in each inference. GLM-5.1, for example, has 744 billion parameters but uses only 40 billion active per token. This reduces the computational cost of inference by more than 90% compared with a dense model of equivalent size.

Who is using it

GLM-5.1 was not a pioneer. Mistral's Mixtral (2024) was one of the first widely adopted MoE models. GPT-4 almost certainly uses some variant of the architecture. Alibaba's Qwen, with 397 billion parameters, is also based on MoE.

In 2026, the trend is clear: the largest models on the market are practically all MoE. Dense models are being reserved for smaller sizes, where the overhead of routing does not pay off.

Implications for infrastructure

For anyone operating data centers or planning AI infrastructure, MoE has direct implications:

GPU memory: you need to load all parameters into memory even if only a fraction is used per inference. A 744-billion-parameter model in FP16 requires approximately 1.5 TB of VRAM — which means multiple A100/H100 GPUs in parallel, even if only 40 billion are activated.

Latency: routing adds minimal latency per inference, but the throughput gain is far greater. For batch workloads, MoE is clearly superior.

Rack temperature: MoE workloads have an irregular consumption pattern — spikes when rare experts are activated, low consumption on simple tasks. Thermal management needs to account for this behavior.

Why this changes the cost calculation

In the cloud, you pay for compute used. A 744-billion-parameter MoE model processing a simple request costs significantly less than an equivalent dense model — because most of the parameters were not activated.

For high-volume operations, this efficiency changes the TCO substantially. It is one of the reasons MoE models tend to have lower per-token prices on the market's leading APIs.

The direction of the market

MoE is not an architectural curiosity — it is the path the market chose to scale capacity without scaling cost in the same proportion. Understanding how it works is increasingly relevant for anyone making decisions about AI infrastructure.