GLM 5.1 Architecture -- Complete Technical Analysis (MLA + DSA + MoE)

GLM-5.1 is Z.ai (Zhipu)'s second-generation agentic-coding flagship and the first member of the glm_moe_dsa model family in HuggingFace transformers. It is a 754B total / ~40B active sparse Mixture-of-Experts decoder, trained on 28.5 trillion tokens, optimized for long-horizon agentic engineering: hundreds of rounds, thousands of tool calls, full repository edits. Architecturally, it sits at the intersection of three independent lineages -- it inherits its body from GLM-4.6, its attention from DeepSeek V3, and its sparse-attention indexer from DeepSeek V3.2 -- and combines them into a single unified design.

The Three Pillars

Every GLM-5.1 decoder layer is built from three independent technologies, each addressing a different bottleneck of trillion-parameter models:

1. Multi-head Latent Attention (MLA)

From DeepSeek V2/V3. Compresses keys and values into a low-rank latent (kv_lora_rank=512), and the query through a separate latent (q_lora_rank=2048). At decode time the cached state is the compressed latent, not the expanded K/V -- a roughly ~10x KV-cache reduction versus a comparable GQA model. RoPE is applied only to a small decoupled subspace (qk_rope_head_dim=64) of each head.

2. DeepSeek Sparse Attention (DSA)

New in GLM-5.1 (mirroring DeepSeek V3.2). A lightweight Indexer sits in front of every attention block, scoring all past tokens against the current query using a separate small projection (32 heads × 128 dim). Only the top-2,048 keys per query position are kept; the rest are masked to -inf. This makes the per-token attention cost independent of sequence length beyond ~2K, enabling the published 200K context.

3. Fine-grained MoE with shared expert

A 256-routed-expert + 1-shared-expert MoE with sigmoid gating, top-8 routing, and DeepSeek-V3-style auxiliary-loss-free balancing via e_score_correction_bias. Each routed expert is small (moe_intermediate_size=2048) -- only 1/6 of the dense FFN width -- so 8 routed experts deliver only ~1.3x the compute of one dense FFN, while the model has access to the full 256-expert pool. The shared expert provides always-on baseline capacity.

4. Three dense layers up front

The first 3 of 78 decoder layers use a plain dense GLU MLP (intermediate_size=12288) instead of MoE. Inherited from the GLM-4 family. The remaining 75 layers are sparse. The rationale (originally from DeepSeek-V3): early layers do general feature extraction where routing decisions are unstable, so the gradient signal benefits from a fully dense path before routing kicks in.

Each layer applies pre-norm RMSNorm, then MLA+DSA self-attention, residual, post-norm RMSNorm, then dense-or-MoE MLP, residual. The structure is otherwise classical: there are no parallel residual streams, no per-layer embeddings, no sliding/global hybrid -- just one homogeneous block repeated 78 times. The novelty is entirely in what happens inside the attention and MLP subblocks.

What Changed from GLM-5 (and from GLM-4.6)

GLM-5.1's predecessor inside transformers is glm4_moe_lite (the GLM-5 architecture, also MLA-based). Before that, GLM-4.5/4.6 used the simpler glm4_moe (standard GQA with QK-norm, inherited from Cohere/DeepSeek-V3). The deltas at each step:

GLM-4.6 → GLM-5 -- Replaced standard GQA (96 Q heads, 8 KV heads) with MLA (DeepSeek V2 latent attention, q_lora=768, kv_lora=512, head split into nope=192 + rope=64). Doubled context to ~200K. Switched routing to top-4 of 64 experts (lite) / top-8 of 256 (full). Adopted interleaved RoPE.
GLM-5 → GLM-5.1: Add DSA Indexer -- Every attention block now contains a GlmMoeDsaIndexer: a small MLP (32 heads × 128 dim, separate from main attention) that produces a per-query top-2048 mask. The mask is added to the causal mask as a -inf/0 sparse pattern. Mainline attention is then computed only over those 2048 keys per query, even if the cache contains 200K. Borrowed directly from DeepSeek V3.2.
GLM-5 → GLM-5.1: Larger query LoRA -- q_lora_rank grew from 768 → 2048. The query latent has to feed both the main attention and the new Indexer's wq_b, so it needs more capacity.
GLM-5 → GLM-5.1: Switch to non-interleaved RoPE -- GLM-5 used rope_interleave=True (DeepSeek V3 style: alternate even/odd pairs). GLM-5.1 explicitly removes that attribute (rope_interleave = AttributeError()) and uses split-half NeoX/Llama RoPE. The Indexer applies the same NeoX RoPE to its own decoupled q_pe/k_pe.
GLM-5 → GLM-5.1: 3 dense layers instead of 1 -- mlp_layer_types default went from ["dense"] + ["sparse"]*(L-1) to ["dense"]*3 + ["sparse"]*(L-3). The GLM-5.1 paper attributes this to improved router stability at trillion-param scale.
GLM-5 → GLM-5.1: Bigger router scale factor -- routed_scaling_factor raised from 1.8 → 2.5. The factor multiplies the post-normalization expert weights, inflating the contribution of routed experts relative to the residual stream. With deeper models and more experts, the per-expert weights shrink (one of 8 of 256 instead of one of 4 of 64), so the scaling factor compensates.
GLM-5 → GLM-5.1: 78 layers, 6,144 hidden, 64 heads -- The flagship checkpoint dimensions. 154,880-token vocabulary. Untied embeddings. max_position_embeddings=202,752 (~200K).
FP8 native -- The HuggingFace checkpoint ships in FP8 with one fp32 escape hatch: indexer.weights_proj stays in plain bf16/fp32 because the reference implementation uses fp32 for it and the FP8 quantizer uses _keep_in_fp32_modules to preserve it.

Model Overview

GLM-5.1 (754B-A40B)

MLADSAMoE 256x

Total Params~754B

Active Params~40B

Context202,752 (~200K)

Hidden Size6,144

Layers78 (3 dense + 75 MoE)

GLM-5 predecessor (lite defaults)

MLAMoE 64x

Total Params~745B (flagship)

Active Params~32B

Context202,752

Hidden Size2,048 (lite default)

Layers47 (1 dense + rest MoE)

GLM-4.6 last GQA generation

GQAMoE 128x

Total Params~355B

Active Params~32B

Context131,072 (128K)

Hidden Size4,096 (default)

Layers46 (1 dense + 45 MoE)

DeepSeek V3.2 DSA reference

MLADSAMoE 256x

Total Params~671B

Active Params~37B

Context163,840 (160K)

Hidden Size7,168

Layers61 (3 dense + 58 MoE)

Parameter Comparison

Parameter	GLM-5.1 (glm_moe_dsa)	GLM-5 (glm4_moe_lite defaults)	GLM-4.6 (glm4_moe defaults)	DeepSeek V3.2
Total Params	~754B	~745B (flagship)	~355B (flagship)	~671B
Active Params	~40B	~32B	~32B	~37B
Context	202,752	202,752	131,072	163,840
Vocab	154,880	154,880	151,552	129,280
Hidden Size	6,144	2,048	4,096	7,168
Layers	78	47	46	61
Dense Layers	3	1	1	3
MoE Layers	75	46	45	58
Attention Type	MLA + DSA	MLA	GQA	MLA + DSA
Q Heads	64	20	96	128
KV Heads	64 (MLA)	20 (MLA)	8	128 (MLA)
q_lora_rank	2,048	768	--	1,536
kv_lora_rank	512	512	--	512
qk_nope_head_dim	192	192	--	128
qk_rope_head_dim	64	64	--	64
v_head_dim	256	256	head_dim	128
qk_head_dim (total)	256	256	head_dim	192
DSA Indexer	yes	--	--	yes
index_topk	2,048	--	--	2,048
index_n_heads	32	--	--	64
index_head_dim	128	--	--	128
FFN Type	MoE + shared	MoE + shared	MoE + shared	MoE + shared
Dense FFN hidden	12,288	10,240	10,944	18,432
MoE expert hidden	2,048	1,536	1,408	2,048
Routed experts	256	64	128	256
Shared experts	1	1	1	1
Experts per token	8	4	8	8
Routed scaling factor	2.5	1.8	1.0	2.5
norm_topk_prob	True	True	True	True
e_score_correction_bias	yes	yes	yes	yes
Group routing	1 group	1 group	1 group	8 groups
Activation	SiLU	SiLU	SiLU	SiLU
RoPE style	NeoX/Llama (split-half)	interleaved	standard, partial=0.5	interleaved
Norm	RMSNorm (eps=1e-5)	RMSNorm	RMSNorm	RMSNorm
QK Norm (in attn)	q_a / kv_a only	q_a / kv_a only	optional	q_a / kv_a only
Indexer k_norm	LayerNorm (eps=1e-6)	--	--	LayerNorm
Attention bias	False	False	False	False
Tie Embeddings	False	False	False	False
FP8 native	yes (bf16 escape: indexer.weights_proj)	--	--	yes

Benchmarks

Source: zai-org/GLM-5.1 model card and z.ai blog post. Comparison numbers are best-public for the named generation; GLM-5.1 sets state-of-the-art on SWE-Bench Pro, CyberGym, and BrowseComp.

Benchmark	GLM-5.1	GLM-5	GLM-4.6	Claude 3.7 Sonnet (ref.)
SWE-Bench Pro (verified)	58.4% (SOTA)	53.2%	40.1%	54.7%
NL2Repo	42.7	35.9	--	--
Terminal-Bench 2.0	63.5 (66.5*)	54.6	--	62.3
CyberGym	68.7% (SOTA)	52.1%	--	--
BrowseComp	68.0% (SOTA)	61.4%	--	--
AIME 2026 (no tools)	95.3	91.0	78.4	--
GPQA Diamond	86.2	83.4	75.7	84.8
LiveCodeBench v6	75.4	--	--	--

*Terminal-Bench 2.0 self-reported via Claude Code-style harness. The most striking improvement is the compounding of long-horizon agent loops: GLM-5.1's training emphasizes "hundreds of rounds" and "thousands of tool calls", which is reflected in SWE-Bench Pro and Terminal-Bench more than in single-shot reasoning tasks.

Per-Block Parameter Estimates

Computed from the public config (configuration_glm_moe_dsa.py): hidden=6144, q_lora=2048, kv_lora=512, qk_nope=192, qk_rope=64, v_dim=256, 64 heads, 256 routed experts (top-8), shared expert hidden=2048, dense FFN hidden=12288, indexer 32×128. RMSNorm/LayerNorm bias terms are negligible and excluded.

Attention Block (every layer) — MLA + DSA Indexer

Component	Params	Formula
q_a_proj	12.58M	6144 × 2048
q_a_layernorm	2.0K	2048 (RMSNorm)
q_b_proj	33.55M	2048 × (64 × 256)
kv_a_proj_with_mqa	3.54M	6144 × (512 + 64)
kv_a_layernorm	0.5K	512 (RMSNorm)
kv_b_proj	14.68M	512 × (64 × (192+256))
o_proj	100.66M	(64 × 256) × 6144
MLA subtotal	165.0M
indexer.wq_b	8.39M	2048 × (32 × 128)
indexer.wk	0.79M	6144 × 128
indexer.k_norm	0.3K	128 (LayerNorm)
indexer.weights_proj	0.20M	6144 × 32
Indexer subtotal	9.4M
Block total	~174.4M	per layer × 78 = 13.6B

78 layers × 174.4M = ~13.6B in attention. The Indexer adds only ~9.4M per layer (~5.4% of attention) but lets every attention block address a 200K context with ~2K-token effective memory bandwidth.

Dense MLP Block — layers 0-2 only (3 of 78)

Component	Params	Formula
gate_proj	75.50M	6144 × 12288
up_proj	75.50M	6144 × 12288
down_proj	75.50M	12288 × 6144
SiLU SwiGLU FFN	~226.5M	3 × d × 4d
input_layernorm	6.1K	RMSNorm 6144
post_attention_layernorm	6.1K	RMSNorm 6144
Dense layer total	~400.9M	attn + dense FFN + 2 norms

3 dense layers × 400.9M = ~1.2B. Always active, no routing. The dense intermediate is exactly 6144 × 2 = 12288, narrower than the 4× convention. The first three layers ground the residual stream in general features before sparse routing kicks in.

MoE MLP Block — layers 3-77 (75 of 78)

Component	Params	Formula
shared expert (1×)	37.75M	3 × 6144 × 2048
routed experts (256×)	9,663.7M	256 × 3 × 6144 × 2048
router.weight	1.57M	256 × 6144
e_score_correction_bias	0.3K	256 (fp32 buffer)
FFN total capacity	~9.70B	shared + 256 routed + router
FFN active (top-8 + shared)	~341.3M	(8 + 1) × 37.75M + router
input_layernorm	6.1K	RMSNorm 6144
post_attention_layernorm	6.1K	RMSNorm 6144
MoE layer capacity	~9.87B	attn + MoE + 2 norms
MoE layer active	~515.7M	attn + (top-8+shared)

Capacity: 75 × 9.87B = ~740.3B. Active per token: 75 × 515.7M = ~38.7B. Each routed expert is only 37.75M -- a tiny FFN. The full pool of 256 contains ~9.66B per layer.

Whole-Model Summary — 754B / 40B

Component	Capacity	Active / token
embed_tokens	951.4M	~6.1K (1 row)
3 dense layers	1.20B	1.20B
75 MoE layers	740.3B	38.7B
final norm	6.1K	6.1K
lm_head (untied)	951.4M	951.4M
Total	~743.6B*	~40.8B

*The official figure is 754B. The ~10B gap between my parameter accounting and the published number is consistent across reports (some sources say 744B, others 754B); it reflects implementation details such as bias terms in checkpoint conversion, fp32 buffers, and expert-specific scale tensors that vary across releases. The active-per-token figure of ~40.8B matches the published "~40B active".

GPU Memory Requirements

Estimates for the full 754B model. KV cache numbers reflect MLA's compressed cache: per token only the latent kv_lora_rank=512 + qk_rope_head_dim=64 = 576 elements per layer are actually needed for decode (the reference implementation can run from compressed). The current transformers implementation expands K/V before storing, so the figures below show both regimes.

Weight Memory (all 754B params loaded)

Precision	GLM-5.1 754B	GLM-5 ~745B	GLM-4.6 ~355B	DSV3.2 671B
BF16	~1,508 GB	~1,490 GB	~710 GB	~1,342 GB
FP8 (native)	~754 GB	~745 GB	~355 GB	~671 GB
INT4	~377 GB	~373 GB	~178 GB	~336 GB

KV Cache (BF16, batch=1) — expanded vs MLA-compressed

Context	Expanded (HF default)	MLA compressed	Reduction
4K	~3.7 GB	~0.36 GB	~10.3×
32K	~30.0 GB	~2.9 GB	~10.3×
128K	~120 GB	~11.5 GB	~10.4×
200K	~187 GB	~18.0 GB	~10.4×

Expanded: 78 layers × 64 heads × (256 + 256) elems × 2 bytes per token = ~5.13 MB/token. Compressed (latent only): 78 layers × (512 + 64) elems × 2 bytes = ~89.9 KB/token. The Indexer maintains its own ~1 KB/token side cache (78 layers × 128 dims × 2 bytes), negligible.

Total VRAM & Hardware Recommendation (FP8, batch=1)

Scenario	Weights	+KV @128K	Total	Hardware
FP8, expanded KV	754 GB	+120 GB	874 GB	11× H100 80GB or 6× H200 141GB
FP8, MLA compressed	754 GB	+11.5 GB	765 GB	10× H100 80GB or 6× H200 141GB
INT4, MLA compressed	377 GB	+5.7 GB	383 GB	5× H100 80GB or 3× H200 141GB
FP8 + DSA active KV (~2K)	754 GB	+0.18 GB	754 GB	10× H100 80GB

The DSA Indexer doesn't reduce KV-cache storage -- the cache still holds all 200K tokens. What it reduces is compute: the main attention only multiplies queries against ~2K selected keys per query position, regardless of context length. This makes DSA orthogonal to MLA: MLA compresses what is stored, DSA compresses what is read. Both are needed for practical 200K-context decode at 754B scale.

Key Architectural Innovations in GLM-5.1

Multi-head Latent Attention (MLA) with two LoRA paths Both queries and keys/values are projected through low-rank latents before being expanded to per-head form. q_a_proj compresses hidden → 2048, then q_b_proj expands to 64 heads × 256 dim. kv_a_proj_with_mqa packs the kv latent (512) and the rope-only k stream (64) in a single matmul; kv_b_proj expands the latent into both K-nope (192) and V (256) per head. The cached state can be just the latents (~10× smaller than expanded GQA), though the HF transformers code currently stores expanded K/V for backend compatibility.
Decoupled RoPE: 64 of 256 head dimensions are positional Every Q/K head is split into a 192-dim "no-position" content stream and a 64-dim "rope" position stream. RoPE is applied only to the 64-dim slice; the 192 dims pass through unchanged. The two halves are concatenated back together before the dot product, so positional and content channels are computed in a single attention op. The K-rope stream is a single shared head (broadcast across all 64 query heads), saving more parameters.
DeepSeek Sparse Attention (DSA) Indexer Each attention block contains a tiny side network (32 heads × 128 dim, ~9.4M params) that scores all past tokens and returns the top 2,048 per query. Its query projection re-uses the main attention's q_a_layernorm output (the q_resid latent) -- it doesn't need its own q_a, just a small wq_b. Its key projection wk reads directly from hidden_states with its own LayerNorm (eps=1e-6, distinct from RMSNorm). Per-head weights from weights_proj (kept in fp32 even when the rest of the model is FP8) fuse the head scores via a sum.
Indexer maintains its own key cache (separate from DynamicCache) Transformers' DynamicCache is sized to exactly num_hidden_layers attention layers, leaving no room for the indexer's keys. So the indexer stores its own _cached_keys tensor as a plain attribute, concatenating along sequence dimension on each decode step. On prefill (when seq_len > 1) the cache is reset to avoid stale data. This is invisible to user code but a critical correctness detail.
DSA mask construction: scatter top-k indices to a -inf grid The combined attention mask is built by allocating an all--inf matrix of shape [B, S, T], then scatter_-ing zeros into the top-2048 positions selected by the indexer. The result is added to the regular causal mask. The main attention then runs a normal SDPA forward, and any non-top-k tokens contribute zero through the softmax. The indices=topk_indices kwarg is also forwarded to specialized flash-mla kernels (kernels-community/flash-mla) which can skip the masked positions entirely.
Score formula: ReLU + per-head fusion The indexer scores each (query, key) pair with relu(softmax_scale · q·k) · head_weight, summed over its 32 heads. The ReLU is the key non-linearity -- it lets negative dot products contribute exactly zero rather than dragging the score down, and matches what the FP8 reference kernel does. weights_proj produces per-token, per-head weights from hidden_states, scaled by n_heads^(-0.5).
NeoX/Llama-style RoPE (not interleaved like GLM-5) GLM-5 (and DeepSeek V2/V3) use interleaved RoPE: pairs of (even, odd) dims are rotated together. GLM-5.1 explicitly removes that path (rope_interleave = AttributeError() in the modular config) and adopts split-half RoPE: the head_dim is split in two halves and the second half is rotated against the first. This matches Llama, GPT-NeoX, and most other transformers models, simplifying interop.
Sigmoid gating with auxiliary-loss-free balancing Like DeepSeek V3, the router computes per-expert sigmoid probabilities (not softmax) and adds a per-expert e_score_correction_bias (kept in fp32, initialized to zero, updated heuristically during training). The bias only influences which experts are selected, not the post-selection weights. This eliminates auxiliary load-balancing losses and lets the model train without dropping any tokens.
Top-8 of 256 routed experts + 1 always-on shared expert Each token activates 8 routed experts (top-k from the bias-corrected sigmoid scores) plus 1 shared expert that always fires. The 8 routed weights are normalized to sum to 1 then multiplied by routed_scaling_factor=2.5. The shared expert output is added without the routed scaling. Net effect: the residual stream sees roughly 8× expert capacity for routed work plus a constant baseline from the shared expert.
Group-based selection (degenerate at n_group=1) The router still runs the DeepSeek-V3 group-routing logic -- partition the 256 experts into n_group groups, select top-2 experts per group to compute group scores, pick the top topk_group groups, then top-k experts within selected groups. With n_group=1 and topk_group=1 (the GLM-5.1 default) this collapses to plain top-8 over all 256 experts. The machinery is preserved for forward compatibility.
3 dense + 75 MoE layers The first 3 layers use a plain SwiGLU with intermediate_size=12288; layers 3-77 are MoE. The dense prefix gives the early residual stream a stable, smooth feature space before sparse routing introduces gradient noise. mlp_layer_types = ["dense"]*3 + ["sparse"]*75 -- doubled from GLM-5's single dense layer.
Naive 3D-tensor MoE implementation Experts are stored as two 3D parameter tensors: gate_up_proj[num_experts, 2*intermediate, hidden] and down_proj[num_experts, hidden, intermediate]. The forward dispatches each token to its top-k experts via a one-hot mask + Python loop over hit experts. This is slow but correct; production deployment uses fused kernels via the @use_experts_implementation decorator.
FP8 native checkpoint with one fp32 escape The published HF checkpoint is FP8. The single module that must stay in higher precision is indexer.weights_proj, listed in _keep_in_fp32_modules. The reference implementation uses fp32 for it because the weights gate per-head contributions to a sparse top-k selection -- small precision errors there cascade into incorrect token selections. The model also keeps e_score_correction_bias in fp32 via _keep_in_fp32_modules_strict.
Untied embeddings, lm_head separate tie_word_embeddings=False. The 951M-parameter input embedding and 951M-parameter lm_head are independent matrices. With a 154,880 vocab and 6,144 hidden, each is 6.144 × 154.88 ≈ 951M parameters -- ~1.9B total just for token-IO.
200K context out of the box max_position_embeddings=202752, no YARN, no NTK, no length extrapolation tricks. The context is supported natively because (a) MLA's compressed cache makes long sequences memory-tractable, and (b) DSA's top-2048 selection makes them compute-tractable. Without DSA the attention compute would be quadratic in sequence length even with MLA.
Cleaned-up attention class via modular_glm_moe_dsa.py The implementation lives in a "modular" file -- a transformers convention where the new model imports symbols from related models and only overrides what differs. GlmMoeDsaConfig inherits from Glm4MoeLiteConfig; the decoder layer inherits from Glm4MoeLiteDecoderLayer; the model body inherits from Glm4MoeModel. Only the attention class, the indexer, and the config additions are new code. The full modeling_glm_moe_dsa.py is auto-generated from this modular file by CI.

Deep Dive: Multi-head Latent Attention (MLA)

MLA is a 2024 invention by DeepSeek that compresses keys and values into a low-rank latent before per-head expansion. GLM-5.1 inherits the design directly from glm4_moe_lite and through it from DeepSeek V3. Unlike GQA -- which reduces the number of KV heads -- MLA reduces the rank of the K/V projections, then uses a small upcasting matrix (kv_b_proj) to recover full per-head representations on demand. This decouples cache size from head count and lets GLM-5.1 keep all 64 heads "full" while caching only ~580 elements per token per layer.

Query Path

x → q_a_proj → q_a_layernorm → q_b_proj. The q_resid intermediate (after the layer-norm) is reused by the DSA indexer below, so MLA's query LoRA is shared between main attention and the sparsity selector. Output is reshaped to [B, H, S, qk_head_dim] = [B, 64, S, 256], then split into nope (192) + rope (64).

KV Path

A single matmul kv_a_proj_with_mqa produces [B, S, 576] -- the concatenation of the 512-dim KV latent and the 64-dim shared K-rope stream. The latent is normed by kv_a_layernorm then expanded by kv_b_proj into [B, S, 64 × (192+256)], which is then split into K-nope and V. The K-rope stream stays as a single shared head across all 64 query heads, broadcast at the dot-product step.

Decoupled RoPE

RoPE is applied only to the 64-dim "rope" slice of each head, not the 192-dim "nope" slice. The reason: at 200K context, RoPE's lowest-frequency dimensions complete a full rotation -- ruining whatever positional signal they carried. By rotating only 1/4 of head dims, the model gets some positional encoding per head but reserves the bulk of head capacity for content.

The Cache Win

Per-token, per-layer KV cache:

GLM-4.6 (GQA 96/8): 8 × 128 × 2 = 2,048 elements
GLM-5.1 (MLA expanded): 64 × (256 + 256) = 32,768 elements
GLM-5.1 (MLA latent): 512 + 64 = 576 elements

The latent path is the natural one for MLA -- you store the latent and expand at dot-product time. The HF transformers implementation currently stores the expanded form for backend compatibility, but flash-mla kernels (kernels-community/flash-mla) consume the latent directly.

MLA Forward Pass — one query position

MLA vs GQA: Cache and Param Cost Per Layer

Aspect	GLM-4.6 GQA (96/8 heads, d=128)	GLM-5.1 MLA (64 heads, d=256)
Q params	~50.3M (4096×96×128)	~46.1M (q_a + q_b LoRA path)
K params	~4.2M (4096×8×128)	~3.5M + 14.7M (kv_a + kv_b shared with V)
V params	~4.2M	(shared with K via kv_b_proj)
O params	~50.3M	~100.7M
Cache (per token)	2,048 elems	576 elems (latent) / 32,768 (expanded)
Compute (decode)	96 × (1+T)×128 dot products	64 × (1+T)×256 dot products

Deep Dive: DeepSeek Sparse Attention (DSA) Indexer

The DSA Indexer is the single biggest novelty separating GLM-5.1 from GLM-5. Borrowed directly from DeepSeek V3.2, it's a small parallel network -- only ~9.4M params per layer -- that scores every past token against the current query and selects the top 2,048 to actually attend to. The main attention then runs on a sparse mask: queries see only the indexer-selected keys, everything else is -inf. This makes per-token attention compute roughly constant beyond ~2K context.

Why a Separate Indexer?

Computing top-k over the full attention scores would require materializing them first -- exactly the cost we're trying to avoid. So DSA uses a cheap proxy score: a small projection (32 heads × 128 dim, vs the main attention's 64 heads × 256 dim) computes a fast estimate of which keys matter most. The proxy reuses MLA's q_resid as input, avoiding a redundant linear, and reads keys from raw hidden_states through its own wk.

Score Formula

For each query position s and key position t:

score[s,t] = Σ_h weights[s,h] · ReLU(softmax_scale · q[s,h]·k[t])

where weights[s,h] comes from a separate weights_proj linear (kept in fp32). The ReLU is critical -- it prevents negative dot products from polluting positive ones during the per-head sum, and matches the FP8 reference kernel's behavior.

Top-K Mask Construction

Once the indexer returns topk_indices of shape [B, S, 2048], the main attention mask is built by:

Allocate index_mask = full(-inf, [B, S, T])
index_mask.scatter_(-1, topk_indices, 0.0)
Add to causal mask: combined = index_mask + causal_mask

The result: zero where the indexer says "attend here", -inf everywhere else, plus normal causal masking. Standard SDPA handles the rest.

Independent Cache

The transformers DynamicCache is sized to exactly num_hidden_layers attention layers, with no slot for the indexer. So the indexer maintains its own _cached_keys tensor as a regular Python attribute. On prefill (seq_len > 1) it resets the cache; on decode it concatenates new keys along the sequence dimension. The cached state is only the indexer's small 128-dim post-norm keys -- ~256 bytes per token per layer in bf16, ~20 KB for a 200K context -- effectively free.

DSA Indexer Forward Pass

DSA Indexer Pseudocode (verified against modeling_glm_moe_dsa.py:144-229)

def indexer.forward(hidden_states, q_resid, position_embeddings, mask, use_cache):
    cos, sin = position_embeddings

    # === Queries (reuse MLA's q_resid latent) ===
    q = wq_b(q_resid).view(B, S, 32, 128)
    q_pe, q_nope = split(q, [64, 64], dim=-1)
    q_pe = apply_rotary_pos_emb(q_pe, cos, sin, unsqueeze_dim=2)
    q = cat([q_pe, q_nope], dim=-1)                          # [B, S, 32, 128]

    # === Keys (own projection from raw hidden_states) ===
    k = k_norm(wk(hidden_states))                            # LayerNorm, eps=1e-6
    k_pe, k_nope = split(k, [64, 64], dim=-1)
    k_pe = apply_rotary_pos_emb(k_pe.unsqueeze(2), cos, sin, dim=2).squeeze(2)
    k = cat([k_pe, k_nope], dim=-1)                          # [B, S, 128]

    # === Indexer's own KV cache (NOT in DynamicCache) ===
    if seq_len > 1:
        self._cached_keys = None                              # reset on prefill
    if use_cache:
        k_cached = cat([self._cached_keys, k], dim=1) if self._cached_keys is not None else k
        self._cached_keys = k_cached
    else:
        k_cached = k

    # === Score (FP32 in critical path) ===
    weights = weights_proj(hidden_states).float() * (32**-0.5)            # [B, S, 32]
    scores  = einsum("bshd,btd->bsht", q.float(), k_cached.float()) * (128**-0.5)
    scores  = F.relu(scores)
    index_scores = einsum("bsht,bsh->bst", scores, weights)              # [B, S, T]

    if mask is not None:
        index_scores = index_scores + mask                                # apply causal

    return index_scores.topk(min(2048, T), dim=-1).indices                # [B, S, 2048]

Deep Dive: MoE Routing — sigmoid + bias correction

GLM-5.1 inherits its routing mechanism from DeepSeek V3 via the GLM-4 lineage. There are two distinguishing choices: (a) the per-expert score is a sigmoid probability, not a softmax, and (b) selection (but not weighting) is biased by a learnable per-expert bias e_score_correction_bias. The bias is updated by an external balancing rule during training -- not by gradient descent -- so the model trains without any auxiliary load-balancing loss. This is the "auxiliary-loss-free" balancing of DeepSeek V3.

Step 1: Sigmoid Scores

The router computes router_logits = hidden_states @ router.weight.T in fp32, then applies sigmoid(). Each expert independently produces a score in [0,1], so multiple experts can be "highly relevant" without competing for normalization mass like softmax would force.

Step 2: Bias-Corrected Selection

The fp32 buffer e_score_correction_bias (initialized to zero, updated heuristically during training) is added to the sigmoid scores -- but only for the purpose of choosing which experts to fire. Underused experts get bumped up; overused experts get pushed down. The selection bias prevents expert collapse without any gradient on the bias.

Step 3: Group-Then-Top-K

Experts are partitioned into n_group groups; each group's score is the sum of its top-2 corrected scores; the top topk_group groups are kept; everything outside those groups is masked out; finally topk(k=8) picks the active experts. For GLM-5.1 n_group=topk_group=1, so this collapses to plain top-8 over all 256 experts. The machinery is preserved for compatibility.

Step 4: Reweight (use original sigmoid, not bias)

Critical detail: the weight applied to each selected expert's output is the original, non-bias-corrected sigmoid score (gathered from router_logits.sigmoid()), not the corrected score. So the bias only changes which experts fire, never how much they contribute. After top-k, the weights are normalized to sum to 1 then multiplied by routed_scaling_factor=2.5.

Step 5: Expert Forward + Shared Expert

For each token, the 8 selected experts produce SwiGLU outputs which are summed weighted by the (normalized, scaled) top-k weights. Independently, the always-on shared expert produces its own SwiGLU output. The total MLP output is routed_output + shared_output -- the shared expert is added without the routing scale, so it behaves like a constant bias path that the routed pool augments.

def forward(hidden_states):
    residuals = hidden_states
    router_logits = self.gate(hidden_states)              # [B*S, 256], fp32
    topk_idx, topk_w = self.route_tokens_to_experts(router_logits)
    routed = self.experts(flat_x, topk_idx, topk_w).view(*shape)
    shared = self.shared_experts(residuals)               # always-on
    return routed + shared

Routing Algorithm (verified against modeling_glm_moe_dsa.py:547-580)

def route_tokens_to_experts(router_logits):
    # 1) Sigmoid (NOT softmax)
    router_logits = router_logits.sigmoid()                                  # [N, 256]

    # 2) Bias-corrected scores for SELECTION ONLY
    scores_for_choice = router_logits + self.gate.e_score_correction_bias    # [N, 256]

    # 3) Group routing (degenerate at n_group=1)
    group_scores = (
        scores_for_choice.view(N, 1, 256).topk(2, dim=-1)[0].sum(dim=-1)
    )                                                                         # [N, 1]
    group_idx  = group_scores.topk(k=1, dim=-1, sorted=False)[1]
    group_mask = zeros_like(group_scores)
    group_mask.scatter_(1, group_idx, 1)                                     # all groups kept
    score_mask = group_mask.unsqueeze(-1).expand(-1, 1, 256).reshape(-1, 256)
    masked_scores = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)

    # 4) Top-k (k=8)
    topk_indices = masked_scores.topk(k=8, dim=-1, sorted=False)[1]          # [N, 8]

    # 5) Re-gather UNCORRECTED sigmoid weights (bias is for selection only!)
    topk_weights = router_logits.gather(1, topk_indices)                     # [N, 8]

    # 6) Normalize then scale
    topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20)
    topk_weights = topk_weights * 2.5                                        # routed_scaling_factor

    return topk_indices, topk_weights

Code-Verified Architecture Details

Verified against transformers/models/glm_moe_dsa. Code quotes are exact, with comments preserved. Files: configuration_glm_moe_dsa.py, modular_glm_moe_dsa.py, modeling_glm_moe_dsa.py.

1. RMSNorm — classical, eps=1e-5 (model), eps=1e-6 (q_a/kv_a inside attention)

GLM-5.1 uses two flavors of normalization. The decoder body and the attention's q_a_layernorm / kv_a_layernorm use GlmMoeDsaRMSNorm (RMS, with learned weight). The DSA Indexer's k_norm is a standard nn.LayerNorm with eps=1e-6 -- a deliberate departure to match the DeepSeek V3.2 reference.

class GlmMoeDsaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps: float = 1e-6) -> None:
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)

2. apply_rotary_pos_emb — NeoX/Llama split-half (not interleaved)

Reused from llama.modeling_llama.rotate_half. Splits the head_dim in two halves; rotates the second half against the first via cos/sin. The function takes a single tensor (not a pair) so it can be applied to q_pe and k_pe independently. The unsqueeze_dim argument is 1 for BHSD layout (main attention) and 2 for BSHD layout (indexer).

def apply_rotary_pos_emb(x, cos, sin, unsqueeze_dim: int = 1) -> torch.Tensor:
    """
    This is the transformers equivalent of DeepSeek V3.2's `apply_rotary_emb(x, freqs_cis, interleaved)`.
    Instead of using complex-number `freqs_cis`, we use pre-split `(cos, sin)` tensors from RotaryEmbedding.
    """
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    # Split-half (NeoX/Llama style): (x[:d/2], x[d/2:])
    return (x * cos) + (rotate_half(x) * sin)

3. RotaryEmbedding — partial_rotary_factor honored via head_dim attribute_map

The config sets attribute_map = {"head_dim": "qk_rope_head_dim"}, so when the rotary embedding asks config.head_dim it actually receives qk_rope_head_dim=64. Combined with the optional partial_rotary_factor from rope_parameters, this lets the RoPE generator emit cos/sin tensors for exactly the rotary subspace, not the full 256-dim head.

class GlmMoeDsaRotaryEmbedding(nn.Module):
    @staticmethod
    def compute_default_rope_parameters(config, device=None, seq_len=None):
        base = config.rope_parameters["rope_theta"]
        partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0)
        head_dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads
        dim = int(head_dim * partial_rotary_factor)
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(...) / dim))
        return inv_freq, 1.0

4. GlmMoeDsaAttention — the full forward pass

The complete attention forward, showing how the Indexer is wired in. Q comes from the q_a/q_b LoRA path; KV is split into a 512-dim latent + 64-dim shared k_pe stream; the latent is normed and expanded by kv_b_proj; the K-pe stream is RoPE'd separately and broadcast across all 64 heads. The Indexer is invoked on q_resid (the q_a output) and the raw hidden_states; its output (top-k indices) becomes a sparse mask combined with the causal mask.

# ===== Query path =====
q_resid = self.q_a_layernorm(self.q_a_proj(hidden_states))  # [B,S,2048]
query_states = self.q_b_proj(q_resid)
query_states = query_states.view(B, S, -1, self.qk_head_dim).transpose(1,2)
q_nope, q_pe = torch.split(query_states,
    [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
q_pe = apply_rotary_pos_emb(q_pe, cos, sin, unsqueeze_dim=1)

# ===== KV path =====
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)        # [B,S,576]
k_compressed, k_pe = torch.split(compressed_kv,
    [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_compressed = self.kv_a_layernorm(k_compressed)              # [B,S,512]

kv_expanded = self.kv_b_proj(k_compressed)                    # [B,S,64*448]
kv_expanded = kv_expanded.view(B, S, -1, self.qk_nope_head_dim + self.v_head_dim)
k_nope, value_states = torch.split(kv_expanded,
    [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
k_nope        = k_nope.transpose(1,2)
value_states  = value_states.transpose(1,2)

# RoPE on the single shared k_pe stream, then broadcast across heads
k_pe = k_pe.view(B, 1, S, self.qk_rope_head_dim)
k_pe = apply_rotary_pos_emb(k_pe, cos, sin, unsqueeze_dim=1)
k_pe = k_pe.expand(-1, k_nope.shape[1], -1, -1)               # [B, 64, S, 64]

query_states = torch.cat([q_nope, q_pe], dim=-1)              # [B,64,S,256]
key_states   = torch.cat([k_nope, k_pe], dim=-1)              # [B,64,S,256]

if past_key_values is not None:
    key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx)

# ===== Indexer (DSA sparse mask) =====
indexer_mask = ...   # broadcast attention_mask to [B,S,T]
topk_indices = self.indexer(
    hidden_states, q_resid, position_embeddings, indexer_mask,
    use_cache=past_key_values is not None,
)                                                              # [B, S, 2048]

# Build combined DSA + causal mask: -inf except top-k
index_mask = torch.full((B, S, T), float("-inf"), ...)
index_mask.scatter_(-1, topk_indices, 0.0)
index_mask = index_mask.unsqueeze(1)                          # [B,1,S,T]
combined_mask = index_mask + causal_mask

attn_output, attn_weights = attention_interface(
    self, query_states, key_states, value_states, combined_mask,
    scaling=self.scaling, indices=topk_indices, **kwargs)
attn_output = self.o_proj(attn_output.reshape(B, S, -1))

5. eager_attention_forward — standard SDPA after MLA expansion

After MLA has produced full [B, H, S, 256] Q and K and [B, H, S, 256] V, the attention computation itself is plain Llama-style SDPA: scaled dot product, mask add, softmax in fp32, dropout, value matmul. repeat_kv is a no-op for GLM-5.1 because num_attention_heads == num_key_value_heads == 64.

def eager_attention_forward(module, query, key, value, attention_mask, scaling, dropout=0.0, **kwargs):
    key_states   = repeat_kv(key, module.num_key_value_groups)    # no-op (groups=1)
    value_states = repeat_kv(value, module.num_key_value_groups)

    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
    if attention_mask is not None:
        attn_weights = attn_weights + attention_mask              # ← combined DSA+causal here
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
    attn_output  = torch.matmul(attn_weights, value_states)
    return attn_output.transpose(1, 2).contiguous(), attn_weights

6. GlmMoeDsaMLP — SwiGLU dense FFN (used by 3 dense layers and as the shared expert)

A vanilla SwiGLU MLP. The same class is reused twice in different roles: (a) as the standalone MLP for layers 0-2 (intermediate_size=12288); (b) as the shared expert inside GlmMoeDsaMoE with intermediate_size = moe_intermediate_size * n_shared_experts = 2048. No biases.

class GlmMoeDsaMLP(nn.Module):
    def __init__(self, config, intermediate_size=None):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size if intermediate_size is None else intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj   = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]   # silu

    def forward(self, x):
        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

7. GlmMoeDsaTopkRouter — FP32 logits, sigmoid, score-correction bias

The router is a single linear projection from hidden_size to n_routed_experts, computed in fp32 even when the rest of the model is in bf16/FP8. The e_score_correction_bias buffer (also fp32, listed in _keep_in_fp32_modules_strict) is the auxiliary-loss-free balancing knob from DeepSeek V3.

class GlmMoeDsaTopkRouter(nn.Module):
    def __init__(self, config: GlmMoeDsaConfig):
        super().__init__()
        self.config = config
        self.top_k = config.num_experts_per_tok                     # 8
        self.n_routed_experts = config.n_routed_experts             # 256
        self.routed_scaling_factor = config.routed_scaling_factor   # 2.5
        self.n_group   = config.n_group                              # 1
        self.topk_group = config.topk_group                          # 1
        self.norm_topk_prob = config.norm_topk_prob                  # True

        self.weight = nn.Parameter(torch.empty((self.n_routed_experts, config.hidden_size)))
        self.register_buffer("e_score_correction_bias",
                             torch.zeros((self.n_routed_experts), dtype=torch.float32))

    def forward(self, hidden_states):
        hidden_states = hidden_states.view(-1, self.config.hidden_size)
        router_logits = F.linear(hidden_states.type(torch.float32),
                                 self.weight.type(torch.float32))
        return router_logits

8. GlmMoeDsaNaiveMoe — 3D expert tensors, naive Python loop

The default expert collection stores all expert weights as 3D parameter tensors: gate_up_proj[256, 4096, 6144] (gate and up packed) and down_proj[256, 6144, 2048]. The forward dispatches each token to its top-k experts via a one-hot mask plus a Python loop over hit experts -- correct but slow. Production deployments replace this via @use_experts_implementation with fused MoE kernels (e.g. SGLang, vLLM).

@use_experts_implementation
class GlmMoeDsaNaiveMoe(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_experts     = config.num_local_experts                # 256
        self.hidden_dim      = config.hidden_size                       # 6144
        self.intermediate_dim = config.moe_intermediate_size            # 2048
        self.gate_up_proj = nn.Parameter(torch.empty(self.num_experts,
                                  2 * self.intermediate_dim, self.hidden_dim))
        self.down_proj    = nn.Parameter(torch.empty(self.num_experts,
                                  self.hidden_dim, self.intermediate_dim))
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, hidden_states, top_k_index, top_k_weights):
        final_hidden_states = torch.zeros_like(hidden_states)
        with torch.no_grad():
            expert_mask = F.one_hot(top_k_index, num_classes=self.num_experts).permute(2,1,0)
            expert_hit  = torch.greater(expert_mask.sum(dim=(-1,-2)), 0).nonzero()

        for expert_idx in expert_hit:
            expert_idx = expert_idx[0]
            top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
            current_state = hidden_states[token_idx]
            gate, up = F.linear(current_state, self.gate_up_proj[expert_idx]).chunk(2, dim=-1)
            current_hidden_states = self.act_fn(gate) * up
            current_hidden_states = F.linear(current_hidden_states, self.down_proj[expert_idx])
            current_hidden_states = current_hidden_states * top_k_weights[token_idx, top_k_pos, None]
            final_hidden_states.index_add_(0, token_idx, current_hidden_states.to(...))

        return final_hidden_states

9. GlmMoeDsaMoE — routed pool + always-on shared expert

The composite MoE module that the decoder layer actually instantiates. It owns the router, the routed experts, and the shared expert. The shared expert receives the original residual (not the post-norm input that the routed experts see -- but that distinction is hidden because residuals = hidden_states here is the input after post_attention_layernorm).

class GlmMoeDsaMoE(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.experts = GlmMoeDsaNaiveMoe(config)
        self.gate    = GlmMoeDsaTopkRouter(config)
        self.shared_experts = GlmMoeDsaMLP(
            config=config,
            intermediate_size=config.moe_intermediate_size * config.n_shared_experts,  # 2048*1
        )

    def forward(self, hidden_states):
        residuals = hidden_states
        orig_shape = hidden_states.shape
        router_logits = self.gate(hidden_states)
        topk_indices, topk_weights = self.route_tokens_to_experts(router_logits)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        hidden_states = self.experts(hidden_states, topk_indices, topk_weights).view(*orig_shape)
        hidden_states = hidden_states + self.shared_experts(residuals)
        return hidden_states

10. GlmMoeDsaDecoderLayer — classic pre-norm transformer block

Two RMSNorms, two residuals, attention, then dense MLP or MoE depending on config.mlp_layer_types[layer_idx]. Inherits from Glm4MoeLiteDecoderLayer; the only thing GLM-5.1 changes is which attention class is instantiated (GlmMoeDsaAttention with the embedded indexer).

class GlmMoeDsaDecoderLayer(GradientCheckpointingLayer):
    def __init__(self, config, layer_idx):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.self_attn = GlmMoeDsaAttention(config, layer_idx)

        if config.mlp_layer_types[layer_idx] == "sparse":
            self.mlp = GlmMoeDsaMoE(config)
        else:
            self.mlp = GlmMoeDsaMLP(config)

        self.input_layernorm           = GlmMoeDsaRMSNorm(config.hidden_size, config.rms_norm_eps)
        self.post_attention_layernorm  = GlmMoeDsaRMSNorm(config.hidden_size, config.rms_norm_eps)

    def forward(self, hidden_states, attention_mask=None, position_ids=None,
                past_key_values=None, use_cache=False, position_embeddings=None, **kwargs):
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        hidden_states, _ = self.self_attn(
            hidden_states=hidden_states, attention_mask=attention_mask,
            position_ids=position_ids, past_key_values=past_key_values,
            use_cache=use_cache, position_embeddings=position_embeddings, **kwargs)
        hidden_states = residual + hidden_states

        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states
        return hidden_states

11. GlmMoeDsaModel — embedding, 78 decoder layers, final norm

Top-level model body. Untied embeddings (no * sqrt(d) scaling). Builds 78 layers, runs them in sequence, applies a final RMSNorm. Cache management uses transformers' standard DynamicCache -- the indexer's separate cache lives inside each attention module, invisible to the model body.

class GlmMoeDsaModel(GlmMoeDsaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList(
            [GlmMoeDsaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        self.norm = GlmMoeDsaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.rotary_emb = GlmMoeDsaRotaryEmbedding(config=config)
        self.gradient_checkpointing = False
        self.post_init()

    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, use_cache=None, **kwargs):
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
        if use_cache and past_key_values is None:
            past_key_values = DynamicCache(config=self.config)
        ...
        causal_mask = create_causal_mask(...)
        hidden_states = inputs_embeds
        position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)

        for decoder_layer in self.layers[:self.config.num_hidden_layers]:
            hidden_states = decoder_layer(
                hidden_states,
                attention_mask=causal_mask,
                position_embeddings=position_embeddings,
                position_ids=position_ids,
                past_key_values=past_key_values,
                use_cache=use_cache, **kwargs)

        hidden_states = self.norm(hidden_states)
        return BaseModelOutputWithPast(last_hidden_state=hidden_states,
                                       past_key_values=past_key_values)

12. GlmMoeDsaForCausalLM — untied lm_head, no logit cap

A standard causal LM head: a single linear from hidden_size to vocab_size, no bias, no logit soft-capping (unlike Gemma 4 which clamps logits with tanh(x/30)*30). The lm_head is not tied to the input embedding -- both are full 6144 × 154880 matrices, contributing ~1.9B parameters between them.

class GlmMoeDsaForCausalLM(GlmMoeDsaPreTrainedModel, GenerationMixin):
    _tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}  # placeholder, tie_word_embeddings=False

    def __init__(self, config):
        super().__init__(config)
        self.model = GlmMoeDsaModel(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
        self.post_init()

13. GlmMoeDsaPreTrainedModel — FP8 escape hatches and flash-mla integration

Sets up the FP8/fp32 boundary: indexer.weights_proj is preserved in fp32 (the per-head weighting that gates the top-k selection), and e_score_correction_bias is preserved in fp32 (the routing balancer). Flash attention is disabled -- the only flash backend supported is the dedicated kernels-community/flash-mla kernel which understands the latent format and the topk indices. SDPA is supported as a generic fallback.

class GlmMoeDsaPreTrainedModel(PreTrainedModel):
    config: GlmMoeDsaConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["GlmMoeDsaDecoderLayer"]

    _supports_flash_attn = False  # flash-mla kernels need a bit more work...
    _supports_sdpa = True
    _supports_flex_attn = False
    _can_compile_fullgraph = True
    _supports_attention_backend = True

    # FP8 quantization uses _keep_in_fp32_modules to decide what NOT to convert
    _keep_in_fp32_modules         = ["indexer.weights_proj"]
    _keep_in_fp32_modules_strict  = ["e_score_correction_bias"]
    _keys_to_ignore_on_load_unexpected = [r"model\.layers\.78.*"]
    _compatible_flash_implementations  = ["kernels-community/flash-mla"]

14. attribute_map — how RoPE finds the rope-only head_dim

A small but consequential trick. Transformers' RotaryEmbedding reads config.head_dim; without intervention this would be the full 256-dim head. But GLM-5.1 only wants RoPE applied to the 64-dim rope subspace. The attribute_map = {"head_dim": "qk_rope_head_dim"} at config-class level rewrites every external read of config.head_dim to return config.qk_rope_head_dim instead, so the RoPE generator emits cos/sin tensors of length 64. Same trick used in DeepSeek V3.

attribute_map = {
    "num_local_experts": "n_routed_experts",   # for MoE generic interfaces
    "head_dim":          "qk_rope_head_dim",   # for RotaryEmbedding -- only the rope subspace
}

15. base_model_tp_plan — tensor-parallel sharding plan

The config also declares the TP plan for distributed inference. Notice the special "mla_kv_a_proj" shard type for kv_a_proj_with_mqa -- it has to be sharded carefully because the output is a concatenation of the 512-dim KV latent and the 64-dim k_pe stream, which need different replication strategies. The router is not sharded (each TP rank computes the full 256-expert score). The experts use a custom "moe_tp_experts" plan that distributes the 256 experts across ranks.

base_model_tp_plan = {
    "layers.*.self_attn.q_b_proj":             "colwise",
    "layers.*.self_attn.kv_a_proj_with_mqa":   "mla_kv_a_proj",
    "layers.*.self_attn.kv_b_proj":            "colwise",
    "layers.*.self_attn.o_proj":               "rowwise",
    "layers.*.mlp.experts.gate_up_proj":       "packed_colwise",
    "layers.*.mlp.experts.down_proj":          "rowwise",
    "layers.*.mlp.experts":                    "moe_tp_experts",
    "layers.*.mlp.shared_experts.gate_proj":   "colwise",
    "layers.*.mlp.shared_experts.up_proj":     "colwise",
    "layers.*.mlp.shared_experts.down_proj":   "rowwise",
    "layers.*.mlp.gate_proj":                  "colwise",   # dense layers
    "layers.*.mlp.up_proj":                    "colwise",
    "layers.*.mlp.down_proj":                  "rowwise",
}

16. mlp_layer_types — 3 dense + 75 sparse, post-init computed

Set in __post_init__ if the user did not provide it: the first min(3, num_layers) layers are dense, the rest are sparse. This is the difference between GLM-5.1 (3 dense) and GLM-5/lite (1 dense). The list is checked at decoder-layer instantiation time.

def __post_init__(self, **kwargs):
    self.qk_head_dim = self.qk_nope_head_dim + self.qk_rope_head_dim    # 192 + 64 = 256

    # MLP layer types: first 3 dense, rest sparse
    if self.mlp_layer_types is None:
        self.mlp_layer_types = (
            ["dense"] * min(3, self.num_hidden_layers)
            + ["sparse"] * (self.num_hidden_layers - 3)
        )
    super().__post_init__(**kwargs)

17. flash-mla integration — topk_indices kwarg

When the attention interface is the dedicated flash-mla kernel, the indexer's topk_indices tensor is forwarded as a kwarg. flash-mla can then skip the gather/scatter mask machinery entirely: it indexes directly into the cached K/V tensors via the top-k indices, multiplies only against those keys, and returns the attention output. The non-flash code path has to materialize the full -inf mask, but the optimized kernel doesn't.

attn_output, attn_weights = attention_interface(
    self,
    query_states,
    key_states,
    value_states,
    combined_mask,
    dropout=0.0 if not self.training else self.attention_dropout,
    scaling=self.scaling,
    indices=topk_indices,        # flash_mla_with_kvcache reads this
    **kwargs,
)

18. Indexer cache reset on prefill

A subtle correctness rule. The indexer maintains its own key cache across decode steps, growing it by one per forward. But on a fresh prompt (prefill, where seq_len > 1) the cache must be reset -- otherwise stale keys from a previous request would pollute the top-k selection. The check is a single if seq_len > 1: self._cached_keys = None at the top of indexer.forward.

# Reset cache on prefill (new prompt) to avoid stale keys / batch-size mismatch
if seq_len > 1:
    self._cached_keys = None

if use_cache:
    if self._cached_keys is not None:
        k_cached = torch.cat([self._cached_keys, k], dim=1)  # [B, T, D]
    else:
        k_cached = k
    self._cached_keys = k_cached
else:
    k_cached = k

19. _keys_to_ignore_on_load_unexpected: layer 78 absent from checkpoint

A small footprint detail in the checkpoint. The published 754B model has 78 hidden layers (indexed 0-77), but the checkpoint may contain a "78" placeholder from earlier training that should be ignored on load. The pattern r"model\.layers\.78.*" tells from_pretrained to silently skip those keys. Identical mechanism to GLM-4.6 which uses r"model\.layers\.46.*" for the same reason.

_keys_to_ignore_on_load_unexpected = [r"model\.layers\.78.*"]

20. Indexer reuses MLA's q_resid — param efficiency

A clever piece of weight sharing. The Indexer's "query side" needs a low-rank latent of the input, but instead of adding its own q_a_proj + q_a_layernorm, it reuses MLA's q_resid -- the output of the main attention's q_a_layernorm(q_a_proj(hidden_states)). The Indexer only owns wq_b (a single 2048→4096 linear) for its query side. This saves ~12.6M params per layer (~1B model-wide) and guarantees the indexer sees the same latent representation as the main attention.

# In GlmMoeDsaAttention.forward:
q_resid = self.q_a_layernorm(self.q_a_proj(hidden_states))   # [B, S, q_lora_rank]
query_states = self.q_b_proj(q_resid)
...
# Same q_resid is forwarded to the indexer:
topk_indices = self.indexer(
    hidden_states,
    q_resid,                                                  # ← reused!
    position_embeddings,
    indexer_mask,
    use_cache=past_key_values is not None,
)

Architecture Stack — one full pass through GLM-5.1

MADL Architecture Diagram

Rendered live from models/glm_5_1.madl via the same MADL parser and SVG renderer used by gemma4.html and the dashboard. The MADL string declares the architecture; the JavaScript below interprets it as a vertical block stack with attention/MoE substructure expanded inline. View the source MADL here.

Sources: huggingface.co/zai-org/GLM-5.1 · transformers/models/glm_moe_dsa · z.ai/blog/glm-5.1 · arXiv 2602.15763 (GLM-5 paper)
Built by reading configuration_glm_moe_dsa.py, modular_glm_moe_dsa.py, and modeling_glm_moe_dsa.py directly.

Generated from MADL Architecture Browser

MIT License · Remek Kinas

GLM 5.1 ARCHITECTURE

The Three Pillars

What Changed from GLM-5 (and from GLM-4.6)