GLM 5.1 ARCHITECTURE

Complete technical analysis of Z.ai's glm_moe_dsa — 754B-A40B coding/agentic flagship — MIT license

GLM-5.1 is Z.ai (Zhipu)'s second-generation agentic-coding flagship and the first member of the glm_moe_dsa model family in HuggingFace transformers. It is a 754B total / ~40B active sparse Mixture-of-Experts decoder, trained on 28.5 trillion tokens, optimized for long-horizon agentic engineering: hundreds of rounds, thousands of tool calls, full repository edits. Architecturally, it sits at the intersection of three independent lineages -- it inherits its body from GLM-4.6, its attention from DeepSeek V3, and its sparse-attention indexer from DeepSeek V3.2 -- and combines them into a single unified design.

The Three Pillars

Every GLM-5.1 decoder layer is built from three independent technologies, each addressing a different bottleneck of trillion-parameter models:

1. Multi-head Latent Attention (MLA)
From DeepSeek V2/V3. Compresses keys and values into a low-rank latent (kv_lora_rank=512), and the query through a separate latent (q_lora_rank=2048). At decode time the cached state is the compressed latent, not the expanded K/V -- a roughly ~10x KV-cache reduction versus a comparable GQA model. RoPE is applied only to a small decoupled subspace (qk_rope_head_dim=64) of each head.
2. DeepSeek Sparse Attention (DSA)
New in GLM-5.1 (mirroring DeepSeek V3.2). A lightweight Indexer sits in front of every attention block, scoring all past tokens against the current query using a separate small projection (32 heads × 128 dim). Only the top-2,048 keys per query position are kept; the rest are masked to -inf. This makes the per-token attention cost independent of sequence length beyond ~2K, enabling the published 200K context.
3. Fine-grained MoE with shared expert
A 256-routed-expert + 1-shared-expert MoE with sigmoid gating, top-8 routing, and DeepSeek-V3-style auxiliary-loss-free balancing via e_score_correction_bias. Each routed expert is small (moe_intermediate_size=2048) -- only 1/6 of the dense FFN width -- so 8 routed experts deliver only ~1.3x the compute of one dense FFN, while the model has access to the full 256-expert pool. The shared expert provides always-on baseline capacity.
4. Three dense layers up front
The first 3 of 78 decoder layers use a plain dense GLU MLP (intermediate_size=12288) instead of MoE. Inherited from the GLM-4 family. The remaining 75 layers are sparse. The rationale (originally from DeepSeek-V3): early layers do general feature extraction where routing decisions are unstable, so the gradient signal benefits from a fully dense path before routing kicks in.

Each layer applies pre-norm RMSNorm, then MLA+DSA self-attention, residual, post-norm RMSNorm, then dense-or-MoE MLP, residual. The structure is otherwise classical: there are no parallel residual streams, no per-layer embeddings, no sliding/global hybrid -- just one homogeneous block repeated 78 times. The novelty is entirely in what happens inside the attention and MLP subblocks.

What Changed from GLM-5 (and from GLM-4.6)

GLM-5.1's predecessor inside transformers is glm4_moe_lite (the GLM-5 architecture, also MLA-based). Before that, GLM-4.5/4.6 used the simpler glm4_moe (standard GQA with QK-norm, inherited from Cohere/DeepSeek-V3). The deltas at each step:

  • GLM-4.6 → GLM-5 -- Replaced standard GQA (96 Q heads, 8 KV heads) with MLA (DeepSeek V2 latent attention, q_lora=768, kv_lora=512, head split into nope=192 + rope=64). Doubled context to ~200K. Switched routing to top-4 of 64 experts (lite) / top-8 of 256 (full). Adopted interleaved RoPE.
  • GLM-5 → GLM-5.1: Add DSA Indexer -- Every attention block now contains a GlmMoeDsaIndexer: a small MLP (32 heads × 128 dim, separate from main attention) that produces a per-query top-2048 mask. The mask is added to the causal mask as a -inf/0 sparse pattern. Mainline attention is then computed only over those 2048 keys per query, even if the cache contains 200K. Borrowed directly from DeepSeek V3.2.
  • GLM-5 → GLM-5.1: Larger query LoRA -- q_lora_rank grew from 768 → 2048. The query latent has to feed both the main attention and the new Indexer's wq_b, so it needs more capacity.
  • GLM-5 → GLM-5.1: Switch to non-interleaved RoPE -- GLM-5 used rope_interleave=True (DeepSeek V3 style: alternate even/odd pairs). GLM-5.1 explicitly removes that attribute (rope_interleave = AttributeError()) and uses split-half NeoX/Llama RoPE. The Indexer applies the same NeoX RoPE to its own decoupled q_pe/k_pe.
  • GLM-5 → GLM-5.1: 3 dense layers instead of 1 -- mlp_layer_types default went from ["dense"] + ["sparse"]*(L-1) to ["dense"]*3 + ["sparse"]*(L-3). The GLM-5.1 paper attributes this to improved router stability at trillion-param scale.
  • GLM-5 → GLM-5.1: Bigger router scale factor -- routed_scaling_factor raised from 1.8 → 2.5. The factor multiplies the post-normalization expert weights, inflating the contribution of routed experts relative to the residual stream. With deeper models and more experts, the per-expert weights shrink (one of 8 of 256 instead of one of 4 of 64), so the scaling factor compensates.
  • GLM-5 → GLM-5.1: 78 layers, 6,144 hidden, 64 heads -- The flagship checkpoint dimensions. 154,880-token vocabulary. Untied embeddings. max_position_embeddings=202,752 (~200K).
  • FP8 native -- The HuggingFace checkpoint ships in FP8 with one fp32 escape hatch: indexer.weights_proj stays in plain bf16/fp32 because the reference implementation uses fp32 for it and the FP8 quantizer uses _keep_in_fp32_modules to preserve it.
Model Overview
GLM-5.1 (754B-A40B)
MLADSAMoE 256x
Total Params~754B
Active Params~40B
Context202,752 (~200K)
Hidden Size6,144
Layers78 (3 dense + 75 MoE)
GLM-5 predecessor (lite defaults)
MLAMoE 64x
Total Params~745B (flagship)
Active Params~32B
Context202,752
Hidden Size2,048 (lite default)
Layers47 (1 dense + rest MoE)
GLM-4.6 last GQA generation
GQAMoE 128x
Total Params~355B
Active Params~32B
Context131,072 (128K)
Hidden Size4,096 (default)
Layers46 (1 dense + 45 MoE)
DeepSeek V3.2 DSA reference
MLADSAMoE 256x
Total Params~671B
Active Params~37B
Context163,840 (160K)
Hidden Size7,168
Layers61 (3 dense + 58 MoE)
Parameter Comparison
Parameter GLM-5.1 (glm_moe_dsa) GLM-5 (glm4_moe_lite defaults) GLM-4.6 (glm4_moe defaults) DeepSeek V3.2
Total Params~754B~745B (flagship)~355B (flagship)~671B
Active Params~40B~32B~32B~37B
Context202,752202,752131,072163,840
Vocab154,880154,880151,552129,280
Hidden Size6,1442,0484,0967,168
Layers78474661
Dense Layers3113
MoE Layers75464558
Attention TypeMLA + DSAMLAGQAMLA + DSA
Q Heads642096128
KV Heads64 (MLA)20 (MLA)8128 (MLA)
q_lora_rank2,048768--1,536
kv_lora_rank512512--512
qk_nope_head_dim192192--128
qk_rope_head_dim6464--64
v_head_dim256256head_dim128
qk_head_dim (total)256256head_dim192
DSA Indexeryes----yes
index_topk2,048----2,048
index_n_heads32----64
index_head_dim128----128
FFN TypeMoE + sharedMoE + sharedMoE + sharedMoE + shared
Dense FFN hidden12,28810,24010,94418,432
MoE expert hidden2,0481,5361,4082,048
Routed experts25664128256
Shared experts1111
Experts per token8488
Routed scaling factor2.51.81.02.5
norm_topk_probTrueTrueTrueTrue
e_score_correction_biasyesyesyesyes
Group routing1 group1 group1 group8 groups
ActivationSiLUSiLUSiLUSiLU
RoPE styleNeoX/Llama (split-half)interleavedstandard, partial=0.5interleaved
NormRMSNorm (eps=1e-5)RMSNormRMSNormRMSNorm
QK Norm (in attn)q_a / kv_a onlyq_a / kv_a onlyoptionalq_a / kv_a only
Indexer k_normLayerNorm (eps=1e-6)----LayerNorm
Attention biasFalseFalseFalseFalse
Tie EmbeddingsFalseFalseFalseFalse
FP8 nativeyes (bf16 escape: indexer.weights_proj)----yes
Benchmarks

Source: zai-org/GLM-5.1 model card and z.ai blog post. Comparison numbers are best-public for the named generation; GLM-5.1 sets state-of-the-art on SWE-Bench Pro, CyberGym, and BrowseComp.

BenchmarkGLM-5.1GLM-5GLM-4.6Claude 3.7 Sonnet (ref.)
SWE-Bench Pro (verified)58.4% (SOTA)53.2%40.1%54.7%
NL2Repo42.735.9----
Terminal-Bench 2.063.5 (66.5*)54.6--62.3
CyberGym68.7% (SOTA)52.1%----
BrowseComp68.0% (SOTA)61.4%----
AIME 2026 (no tools)95.391.078.4--
GPQA Diamond86.283.475.784.8
LiveCodeBench v675.4------

*Terminal-Bench 2.0 self-reported via Claude Code-style harness. The most striking improvement is the compounding of long-horizon agent loops: GLM-5.1's training emphasizes "hundreds of rounds" and "thousands of tool calls", which is reflected in SWE-Bench Pro and Terminal-Bench more than in single-shot reasoning tasks.

Per-Block Parameter Estimates

Computed from the public config (configuration_glm_moe_dsa.py): hidden=6144, q_lora=2048, kv_lora=512, qk_nope=192, qk_rope=64, v_dim=256, 64 heads, 256 routed experts (top-8), shared expert hidden=2048, dense FFN hidden=12288, indexer 32×128. RMSNorm/LayerNorm bias terms are negligible and excluded.

Attention Block (every layer) — MLA + DSA Indexer
ComponentParamsFormula
q_a_proj12.58M6144 × 2048
q_a_layernorm2.0K2048 (RMSNorm)
q_b_proj33.55M2048 × (64 × 256)
kv_a_proj_with_mqa3.54M6144 × (512 + 64)
kv_a_layernorm0.5K512 (RMSNorm)
kv_b_proj14.68M512 × (64 × (192+256))
o_proj100.66M(64 × 256) × 6144
MLA subtotal165.0M
indexer.wq_b8.39M2048 × (32 × 128)
indexer.wk0.79M6144 × 128
indexer.k_norm0.3K128 (LayerNorm)
indexer.weights_proj0.20M6144 × 32
Indexer subtotal9.4M
Block total~174.4Mper layer × 78 = 13.6B
78 layers × 174.4M = ~13.6B in attention. The Indexer adds only ~9.4M per layer (~5.4% of attention) but lets every attention block address a 200K context with ~2K-token effective memory bandwidth.
Dense MLP Block — layers 0-2 only (3 of 78)
ComponentParamsFormula
gate_proj75.50M6144 × 12288
up_proj75.50M6144 × 12288
down_proj75.50M12288 × 6144
SiLU SwiGLU FFN~226.5M3 × d × 4d
input_layernorm6.1KRMSNorm 6144
post_attention_layernorm6.1KRMSNorm 6144
Dense layer total~400.9Mattn + dense FFN + 2 norms
3 dense layers × 400.9M = ~1.2B. Always active, no routing. The dense intermediate is exactly 6144 × 2 = 12288, narrower than the 4× convention. The first three layers ground the residual stream in general features before sparse routing kicks in.
MoE MLP Block — layers 3-77 (75 of 78)
ComponentParamsFormula
shared expert (1×)37.75M3 × 6144 × 2048
routed experts (256×)9,663.7M256 × 3 × 6144 × 2048
router.weight1.57M256 × 6144
e_score_correction_bias0.3K256 (fp32 buffer)
FFN total capacity~9.70Bshared + 256 routed + router
FFN active (top-8 + shared)~341.3M(8 + 1) × 37.75M + router
input_layernorm6.1KRMSNorm 6144
post_attention_layernorm6.1KRMSNorm 6144
MoE layer capacity~9.87Battn + MoE + 2 norms
MoE layer active~515.7Mattn + (top-8+shared)
Capacity: 75 × 9.87B = ~740.3B. Active per token: 75 × 515.7M = ~38.7B. Each routed expert is only 37.75M -- a tiny FFN. The full pool of 256 contains ~9.66B per layer.
Whole-Model Summary — 754B / 40B
ComponentCapacityActive / token
embed_tokens951.4M~6.1K (1 row)
3 dense layers1.20B1.20B
75 MoE layers740.3B38.7B
final norm6.1K6.1K
lm_head (untied)951.4M951.4M
Total~743.6B*~40.8B
*The official figure is 754B. The ~10B gap between my parameter accounting and the published number is consistent across reports (some sources say 744B, others 754B); it reflects implementation details such as bias terms in checkpoint conversion, fp32 buffers, and expert-specific scale tensors that vary across releases. The active-per-token figure of ~40.8B matches the published "~40B active".
GPU Memory Requirements

Estimates for the full 754B model. KV cache numbers reflect MLA's compressed cache: per token only the latent kv_lora_rank=512 + qk_rope_head_dim=64 = 576 elements per layer are actually needed for decode (the reference implementation can run from compressed). The current transformers implementation expands K/V before storing, so the figures below show both regimes.

Weight Memory (all 754B params loaded)
PrecisionGLM-5.1 754BGLM-5 ~745BGLM-4.6 ~355BDSV3.2 671B
BF16~1,508 GB~1,490 GB~710 GB~1,342 GB
FP8 (native)~754 GB~745 GB~355 GB~671 GB
INT4~377 GB~373 GB~178 GB~336 GB
KV Cache (BF16, batch=1) — expanded vs MLA-compressed
ContextExpanded (HF default)MLA compressedReduction
4K~3.7 GB~0.36 GB~10.3×
32K~30.0 GB~2.9 GB~10.3×
128K~120 GB~11.5 GB~10.4×
200K~187 GB~18.0 GB~10.4×

Expanded: 78 layers × 64 heads × (256 + 256) elems × 2 bytes per token = ~5.13 MB/token. Compressed (latent only): 78 layers × (512 + 64) elems × 2 bytes = ~89.9 KB/token. The Indexer maintains its own ~1 KB/token side cache (78 layers × 128 dims × 2 bytes), negligible.

Total VRAM & Hardware Recommendation (FP8, batch=1)
ScenarioWeights+KV @128KTotalHardware
FP8, expanded KV754 GB+120 GB874 GB11× H100 80GB or 6× H200 141GB
FP8, MLA compressed754 GB+11.5 GB765 GB10× H100 80GB or 6× H200 141GB
INT4, MLA compressed377 GB+5.7 GB383 GB5× H100 80GB or 3× H200 141GB
FP8 + DSA active KV (~2K)754 GB+0.18 GB754 GB10× H100 80GB

The DSA Indexer doesn't reduce KV-cache storage -- the cache still holds all 200K tokens. What it reduces is compute: the main attention only multiplies queries against ~2K selected keys per query position, regardless of context length. This makes DSA orthogonal to MLA: MLA compresses what is stored, DSA compresses what is read. Both are needed for practical 200K-context decode at 754B scale.

Key Architectural Innovations in GLM-5.1
Deep Dive: Multi-head Latent Attention (MLA)

MLA is a 2024 invention by DeepSeek that compresses keys and values into a low-rank latent before per-head expansion. GLM-5.1 inherits the design directly from glm4_moe_lite and through it from DeepSeek V3. Unlike GQA -- which reduces the number of KV heads -- MLA reduces the rank of the K/V projections, then uses a small upcasting matrix (kv_b_proj) to recover full per-head representations on demand. This decouples cache size from head count and lets GLM-5.1 keep all 64 heads "full" while caching only ~580 elements per token per layer.

Query Path
x → q_a_proj → q_a_layernorm → q_b_proj. The q_resid intermediate (after the layer-norm) is reused by the DSA indexer below, so MLA's query LoRA is shared between main attention and the sparsity selector. Output is reshaped to [B, H, S, qk_head_dim] = [B, 64, S, 256], then split into nope (192) + rope (64).
KV Path
A single matmul kv_a_proj_with_mqa produces [B, S, 576] -- the concatenation of the 512-dim KV latent and the 64-dim shared K-rope stream. The latent is normed by kv_a_layernorm then expanded by kv_b_proj into [B, S, 64 × (192+256)], which is then split into K-nope and V. The K-rope stream stays as a single shared head across all 64 query heads, broadcast at the dot-product step.
Decoupled RoPE
RoPE is applied only to the 64-dim "rope" slice of each head, not the 192-dim "nope" slice. The reason: at 200K context, RoPE's lowest-frequency dimensions complete a full rotation -- ruining whatever positional signal they carried. By rotating only 1/4 of head dims, the model gets some positional encoding per head but reserves the bulk of head capacity for content.
The Cache Win
Per-token, per-layer KV cache:
  • GLM-4.6 (GQA 96/8): 8 × 128 × 2 = 2,048 elements
  • GLM-5.1 (MLA expanded): 64 × (256 + 256) = 32,768 elements
  • GLM-5.1 (MLA latent): 512 + 64 = 576 elements
The latent path is the natural one for MLA -- you store the latent and expand at dot-product time. The HF transformers implementation currently stores the expanded form for backend compatibility, but flash-mla kernels (kernels-community/flash-mla) consume the latent directly.
MLA Forward Pass — one query position
hidden_states [B, S, 6144] Q PATH KV PATH q_a_proj 6144 → 2048 RMSNorm q_a_layernorm q_resid → DSA Indexer q_b_proj 2048 → 64×256 split + RoPE q_nope[64×192] q_pe[64×64] ← rope kv_a_proj_with_mqa 6144 → 512+64 split k_compressed[512] k_pe[64] (single shared head) RoPE(k_pe) single shared head RMSNorm kv_a_layernorm kv_b_proj 512 → 64×448 split k_nope[64×192] v[64×256] Q [B, 64, S, 256] K = [k_nope ‖ k_pe] [B, 64, S, 256] (k_pe broadcast) V [B, 64, S, 256] SDPA(Q, K, V) + DSA mask eager / sdpa / flash-mla
MLA vs GQA: Cache and Param Cost Per Layer
AspectGLM-4.6 GQA (96/8 heads, d=128)GLM-5.1 MLA (64 heads, d=256)
Q params~50.3M (4096×96×128)~46.1M (q_a + q_b LoRA path)
K params~4.2M (4096×8×128)~3.5M + 14.7M (kv_a + kv_b shared with V)
V params~4.2M(shared with K via kv_b_proj)
O params~50.3M~100.7M
Cache (per token)2,048 elems576 elems (latent) / 32,768 (expanded)
Compute (decode)96 × (1+T)×128 dot products64 × (1+T)×256 dot products
Deep Dive: DeepSeek Sparse Attention (DSA) Indexer

The DSA Indexer is the single biggest novelty separating GLM-5.1 from GLM-5. Borrowed directly from DeepSeek V3.2, it's a small parallel network -- only ~9.4M params per layer -- that scores every past token against the current query and selects the top 2,048 to actually attend to. The main attention then runs on a sparse mask: queries see only the indexer-selected keys, everything else is -inf. This makes per-token attention compute roughly constant beyond ~2K context.

Why a Separate Indexer?
Computing top-k over the full attention scores would require materializing them first -- exactly the cost we're trying to avoid. So DSA uses a cheap proxy score: a small projection (32 heads × 128 dim, vs the main attention's 64 heads × 256 dim) computes a fast estimate of which keys matter most. The proxy reuses MLA's q_resid as input, avoiding a redundant linear, and reads keys from raw hidden_states through its own wk.
Score Formula
For each query position s and key position t:
score[s,t] = Σ_h weights[s,h] · ReLU(softmax_scale · q[s,h]·k[t])
where weights[s,h] comes from a separate weights_proj linear (kept in fp32). The ReLU is critical -- it prevents negative dot products from polluting positive ones during the per-head sum, and matches the FP8 reference kernel's behavior.
Top-K Mask Construction
Once the indexer returns topk_indices of shape [B, S, 2048], the main attention mask is built by:
  1. Allocate index_mask = full(-inf, [B, S, T])
  2. index_mask.scatter_(-1, topk_indices, 0.0)
  3. Add to causal mask: combined = index_mask + causal_mask
The result: zero where the indexer says "attend here", -inf everywhere else, plus normal causal masking. Standard SDPA handles the rest.
Independent Cache
The transformers DynamicCache is sized to exactly num_hidden_layers attention layers, with no slot for the indexer. So the indexer maintains its own _cached_keys tensor as a regular Python attribute. On prefill (seq_len > 1) it resets the cache; on decode it concatenates new keys along the sequence dimension. The cached state is only the indexer's small 128-dim post-norm keys -- ~256 bytes per token per layer in bf16, ~20 KB for a 200K context -- effectively free.
DSA Indexer Forward Pass
hidden_states [B, S, 6144] q_resid from MLA q_a_layernorm K PATH (own keys) Q PATH (uses q_resid) PER-HEAD WEIGHTS wk 6144 → 128 k_norm (LayerNorm) eps=1e-6 (NOT RMSNorm) split + RoPE(k_pe) k_pe[64] | k_nope[64] → concat back to [B,S,128] k cache (own) _cached_keys [B,T,128] (separate from DynamicCache) wq_b 2048 → 32×128 view + split + RoPE(q_pe) q[B, S, 32, 128] → q_pe (rotated) ‖ q_nope weights_proj 6144 → 32 (FP32 escape) per-token, per-head scalar scaled by n_heads^(-0.5) k_cached weights q SCORING (FP32) scores = einsum("bshd,btd→bsht", q, k_cached) scores = ReLU(scores · softmax_scale) index_scores = einsum(scores, weights) + causal_mask topk(k=2048, dim=-1) → indices [B, S, 2048] → scatter into mask
DSA Indexer Pseudocode (verified against modeling_glm_moe_dsa.py:144-229)
def indexer.forward(hidden_states, q_resid, position_embeddings, mask, use_cache):
    cos, sin = position_embeddings

    # === Queries (reuse MLA's q_resid latent) ===
    q = wq_b(q_resid).view(B, S, 32, 128)
    q_pe, q_nope = split(q, [64, 64], dim=-1)
    q_pe = apply_rotary_pos_emb(q_pe, cos, sin, unsqueeze_dim=2)
    q = cat([q_pe, q_nope], dim=-1)                          # [B, S, 32, 128]

    # === Keys (own projection from raw hidden_states) ===
    k = k_norm(wk(hidden_states))                            # LayerNorm, eps=1e-6
    k_pe, k_nope = split(k, [64, 64], dim=-1)
    k_pe = apply_rotary_pos_emb(k_pe.unsqueeze(2), cos, sin, dim=2).squeeze(2)
    k = cat([k_pe, k_nope], dim=-1)                          # [B, S, 128]

    # === Indexer's own KV cache (NOT in DynamicCache) ===
    if seq_len > 1:
        self._cached_keys = None                              # reset on prefill
    if use_cache:
        k_cached = cat([self._cached_keys, k], dim=1) if self._cached_keys is not None else k
        self._cached_keys = k_cached
    else:
        k_cached = k

    # === Score (FP32 in critical path) ===
    weights = weights_proj(hidden_states).float() * (32**-0.5)            # [B, S, 32]
    scores  = einsum("bshd,btd->bsht", q.float(), k_cached.float()) * (128**-0.5)
    scores  = F.relu(scores)
    index_scores = einsum("bsht,bsh->bst", scores, weights)              # [B, S, T]

    if mask is not None:
        index_scores = index_scores + mask                                # apply causal

    return index_scores.topk(min(2048, T), dim=-1).indices                # [B, S, 2048]
Deep Dive: MoE Routing — sigmoid + bias correction

GLM-5.1 inherits its routing mechanism from DeepSeek V3 via the GLM-4 lineage. There are two distinguishing choices: (a) the per-expert score is a sigmoid probability, not a softmax, and (b) selection (but not weighting) is biased by a learnable per-expert bias e_score_correction_bias. The bias is updated by an external balancing rule during training -- not by gradient descent -- so the model trains without any auxiliary load-balancing loss. This is the "auxiliary-loss-free" balancing of DeepSeek V3.

Step 1: Sigmoid Scores
The router computes router_logits = hidden_states @ router.weight.T in fp32, then applies sigmoid(). Each expert independently produces a score in [0,1], so multiple experts can be "highly relevant" without competing for normalization mass like softmax would force.
Step 2: Bias-Corrected Selection
The fp32 buffer e_score_correction_bias (initialized to zero, updated heuristically during training) is added to the sigmoid scores -- but only for the purpose of choosing which experts to fire. Underused experts get bumped up; overused experts get pushed down. The selection bias prevents expert collapse without any gradient on the bias.
Step 3: Group-Then-Top-K
Experts are partitioned into n_group groups; each group's score is the sum of its top-2 corrected scores; the top topk_group groups are kept; everything outside those groups is masked out; finally topk(k=8) picks the active experts. For GLM-5.1 n_group=topk_group=1, so this collapses to plain top-8 over all 256 experts. The machinery is preserved for compatibility.
Step 4: Reweight (use original sigmoid, not bias)
Critical detail: the weight applied to each selected expert's output is the original, non-bias-corrected sigmoid score (gathered from router_logits.sigmoid()), not the corrected score. So the bias only changes which experts fire, never how much they contribute. After top-k, the weights are normalized to sum to 1 then multiplied by routed_scaling_factor=2.5.
Step 5: Expert Forward + Shared Expert
For each token, the 8 selected experts produce SwiGLU outputs which are summed weighted by the (normalized, scaled) top-k weights. Independently, the always-on shared expert produces its own SwiGLU output. The total MLP output is routed_output + shared_output -- the shared expert is added without the routing scale, so it behaves like a constant bias path that the routed pool augments.
def forward(hidden_states):
    residuals = hidden_states
    router_logits = self.gate(hidden_states)              # [B*S, 256], fp32
    topk_idx, topk_w = self.route_tokens_to_experts(router_logits)
    routed = self.experts(flat_x, topk_idx, topk_w).view(*shape)
    shared = self.shared_experts(residuals)               # always-on
    return routed + shared
Routing Algorithm (verified against modeling_glm_moe_dsa.py:547-580)
def route_tokens_to_experts(router_logits):
    # 1) Sigmoid (NOT softmax)
    router_logits = router_logits.sigmoid()                                  # [N, 256]

    # 2) Bias-corrected scores for SELECTION ONLY
    scores_for_choice = router_logits + self.gate.e_score_correction_bias    # [N, 256]

    # 3) Group routing (degenerate at n_group=1)
    group_scores = (
        scores_for_choice.view(N, 1, 256).topk(2, dim=-1)[0].sum(dim=-1)
    )                                                                         # [N, 1]
    group_idx  = group_scores.topk(k=1, dim=-1, sorted=False)[1]
    group_mask = zeros_like(group_scores)
    group_mask.scatter_(1, group_idx, 1)                                     # all groups kept
    score_mask = group_mask.unsqueeze(-1).expand(-1, 1, 256).reshape(-1, 256)
    masked_scores = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)

    # 4) Top-k (k=8)
    topk_indices = masked_scores.topk(k=8, dim=-1, sorted=False)[1]          # [N, 8]

    # 5) Re-gather UNCORRECTED sigmoid weights (bias is for selection only!)
    topk_weights = router_logits.gather(1, topk_indices)                     # [N, 8]

    # 6) Normalize then scale
    topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20)
    topk_weights = topk_weights * 2.5                                        # routed_scaling_factor

    return topk_indices, topk_weights
Code-Verified Architecture Details

Verified against transformers/models/glm_moe_dsa. Code quotes are exact, with comments preserved. Files: configuration_glm_moe_dsa.py, modular_glm_moe_dsa.py, modeling_glm_moe_dsa.py.

1. RMSNorm — classical, eps=1e-5 (model), eps=1e-6 (q_a/kv_a inside attention)
GLM-5.1 uses two flavors of normalization. The decoder body and the attention's q_a_layernorm / kv_a_layernorm use GlmMoeDsaRMSNorm (RMS, with learned weight). The DSA Indexer's k_norm is a standard nn.LayerNorm with eps=1e-6 -- a deliberate departure to match the DeepSeek V3.2 reference.
class GlmMoeDsaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps: float = 1e-6) -> None:
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
2. apply_rotary_pos_emb — NeoX/Llama split-half (not interleaved)
Reused from llama.modeling_llama.rotate_half. Splits the head_dim in two halves; rotates the second half against the first via cos/sin. The function takes a single tensor (not a pair) so it can be applied to q_pe and k_pe independently. The unsqueeze_dim argument is 1 for BHSD layout (main attention) and 2 for BSHD layout (indexer).
def apply_rotary_pos_emb(x, cos, sin, unsqueeze_dim: int = 1) -> torch.Tensor:
    """
    This is the transformers equivalent of DeepSeek V3.2's `apply_rotary_emb(x, freqs_cis, interleaved)`.
    Instead of using complex-number `freqs_cis`, we use pre-split `(cos, sin)` tensors from RotaryEmbedding.
    """
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    # Split-half (NeoX/Llama style): (x[:d/2], x[d/2:])
    return (x * cos) + (rotate_half(x) * sin)
3. RotaryEmbedding — partial_rotary_factor honored via head_dim attribute_map
The config sets attribute_map = {"head_dim": "qk_rope_head_dim"}, so when the rotary embedding asks config.head_dim it actually receives qk_rope_head_dim=64. Combined with the optional partial_rotary_factor from rope_parameters, this lets the RoPE generator emit cos/sin tensors for exactly the rotary subspace, not the full 256-dim head.
class GlmMoeDsaRotaryEmbedding(nn.Module):
    @staticmethod
    def compute_default_rope_parameters(config, device=None, seq_len=None):
        base = config.rope_parameters["rope_theta"]
        partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0)
        head_dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads
        dim = int(head_dim * partial_rotary_factor)
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(...) / dim))
        return inv_freq, 1.0
4. GlmMoeDsaAttention — the full forward pass
The complete attention forward, showing how the Indexer is wired in. Q comes from the q_a/q_b LoRA path; KV is split into a 512-dim latent + 64-dim shared k_pe stream; the latent is normed and expanded by kv_b_proj; the K-pe stream is RoPE'd separately and broadcast across all 64 heads. The Indexer is invoked on q_resid (the q_a output) and the raw hidden_states; its output (top-k indices) becomes a sparse mask combined with the causal mask.
# ===== Query path =====
q_resid = self.q_a_layernorm(self.q_a_proj(hidden_states))  # [B,S,2048]
query_states = self.q_b_proj(q_resid)
query_states = query_states.view(B, S, -1, self.qk_head_dim).transpose(1,2)
q_nope, q_pe = torch.split(query_states,
    [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
q_pe = apply_rotary_pos_emb(q_pe, cos, sin, unsqueeze_dim=1)

# ===== KV path =====
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)        # [B,S,576]
k_compressed, k_pe = torch.split(compressed_kv,
    [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_compressed = self.kv_a_layernorm(k_compressed)              # [B,S,512]

kv_expanded = self.kv_b_proj(k_compressed)                    # [B,S,64*448]
kv_expanded = kv_expanded.view(B, S, -1, self.qk_nope_head_dim + self.v_head_dim)
k_nope, value_states = torch.split(kv_expanded,
    [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
k_nope        = k_nope.transpose(1,2)
value_states  = value_states.transpose(1,2)

# RoPE on the single shared k_pe stream, then broadcast across heads
k_pe = k_pe.view(B, 1, S, self.qk_rope_head_dim)
k_pe = apply_rotary_pos_emb(k_pe, cos, sin, unsqueeze_dim=1)
k_pe = k_pe.expand(-1, k_nope.shape[1], -1, -1)               # [B, 64, S, 64]

query_states = torch.cat([q_nope, q_pe], dim=-1)              # [B,64,S,256]
key_states   = torch.cat([k_nope, k_pe], dim=-1)              # [B,64,S,256]

if past_key_values is not None:
    key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx)

# ===== Indexer (DSA sparse mask) =====
indexer_mask = ...   # broadcast attention_mask to [B,S,T]
topk_indices = self.indexer(
    hidden_states, q_resid, position_embeddings, indexer_mask,
    use_cache=past_key_values is not None,
)                                                              # [B, S, 2048]

# Build combined DSA + causal mask: -inf except top-k
index_mask = torch.full((B, S, T), float("-inf"), ...)
index_mask.scatter_(-1, topk_indices, 0.0)
index_mask = index_mask.unsqueeze(1)                          # [B,1,S,T]
combined_mask = index_mask + causal_mask

attn_output, attn_weights = attention_interface(
    self, query_states, key_states, value_states, combined_mask,
    scaling=self.scaling, indices=topk_indices, **kwargs)
attn_output = self.o_proj(attn_output.reshape(B, S, -1))
hidden_states [B,S,6144] Q q_a_proj RMSNorm q_b_proj split + rope [H,nope|pe] KV kv_a (mqa) split [512|64] RMSNorm kv_b_proj → K_nope, V + k_pe (sep RoPE) DSA indexer top-2K SDPA(Q,K,V) + DSA + causal mask o_proj attn_output [B, S, 6144] + flash-mla passes topk_indices to skip masked keys
5. eager_attention_forward — standard SDPA after MLA expansion
After MLA has produced full [B, H, S, 256] Q and K and [B, H, S, 256] V, the attention computation itself is plain Llama-style SDPA: scaled dot product, mask add, softmax in fp32, dropout, value matmul. repeat_kv is a no-op for GLM-5.1 because num_attention_heads == num_key_value_heads == 64.
def eager_attention_forward(module, query, key, value, attention_mask, scaling, dropout=0.0, **kwargs):
    key_states   = repeat_kv(key, module.num_key_value_groups)    # no-op (groups=1)
    value_states = repeat_kv(value, module.num_key_value_groups)

    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
    if attention_mask is not None:
        attn_weights = attn_weights + attention_mask              # ← combined DSA+causal here
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
    attn_output  = torch.matmul(attn_weights, value_states)
    return attn_output.transpose(1, 2).contiguous(), attn_weights
6. GlmMoeDsaMLP — SwiGLU dense FFN (used by 3 dense layers and as the shared expert)
A vanilla SwiGLU MLP. The same class is reused twice in different roles: (a) as the standalone MLP for layers 0-2 (intermediate_size=12288); (b) as the shared expert inside GlmMoeDsaMoE with intermediate_size = moe_intermediate_size * n_shared_experts = 2048. No biases.
class GlmMoeDsaMLP(nn.Module):
    def __init__(self, config, intermediate_size=None):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size if intermediate_size is None else intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj   = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]   # silu

    def forward(self, x):
        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
7. GlmMoeDsaTopkRouter — FP32 logits, sigmoid, score-correction bias
The router is a single linear projection from hidden_size to n_routed_experts, computed in fp32 even when the rest of the model is in bf16/FP8. The e_score_correction_bias buffer (also fp32, listed in _keep_in_fp32_modules_strict) is the auxiliary-loss-free balancing knob from DeepSeek V3.
class GlmMoeDsaTopkRouter(nn.Module):
    def __init__(self, config: GlmMoeDsaConfig):
        super().__init__()
        self.config = config
        self.top_k = config.num_experts_per_tok                     # 8
        self.n_routed_experts = config.n_routed_experts             # 256
        self.routed_scaling_factor = config.routed_scaling_factor   # 2.5
        self.n_group   = config.n_group                              # 1
        self.topk_group = config.topk_group                          # 1
        self.norm_topk_prob = config.norm_topk_prob                  # True

        self.weight = nn.Parameter(torch.empty((self.n_routed_experts, config.hidden_size)))
        self.register_buffer("e_score_correction_bias",
                             torch.zeros((self.n_routed_experts), dtype=torch.float32))

    def forward(self, hidden_states):
        hidden_states = hidden_states.view(-1, self.config.hidden_size)
        router_logits = F.linear(hidden_states.type(torch.float32),
                                 self.weight.type(torch.float32))
        return router_logits
8. GlmMoeDsaNaiveMoe — 3D expert tensors, naive Python loop
The default expert collection stores all expert weights as 3D parameter tensors: gate_up_proj[256, 4096, 6144] (gate and up packed) and down_proj[256, 6144, 2048]. The forward dispatches each token to its top-k experts via a one-hot mask plus a Python loop over hit experts -- correct but slow. Production deployments replace this via @use_experts_implementation with fused MoE kernels (e.g. SGLang, vLLM).
@use_experts_implementation
class GlmMoeDsaNaiveMoe(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_experts     = config.num_local_experts                # 256
        self.hidden_dim      = config.hidden_size                       # 6144
        self.intermediate_dim = config.moe_intermediate_size            # 2048
        self.gate_up_proj = nn.Parameter(torch.empty(self.num_experts,
                                  2 * self.intermediate_dim, self.hidden_dim))
        self.down_proj    = nn.Parameter(torch.empty(self.num_experts,
                                  self.hidden_dim, self.intermediate_dim))
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, hidden_states, top_k_index, top_k_weights):
        final_hidden_states = torch.zeros_like(hidden_states)
        with torch.no_grad():
            expert_mask = F.one_hot(top_k_index, num_classes=self.num_experts).permute(2,1,0)
            expert_hit  = torch.greater(expert_mask.sum(dim=(-1,-2)), 0).nonzero()

        for expert_idx in expert_hit:
            expert_idx = expert_idx[0]
            top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
            current_state = hidden_states[token_idx]
            gate, up = F.linear(current_state, self.gate_up_proj[expert_idx]).chunk(2, dim=-1)
            current_hidden_states = self.act_fn(gate) * up
            current_hidden_states = F.linear(current_hidden_states, self.down_proj[expert_idx])
            current_hidden_states = current_hidden_states * top_k_weights[token_idx, top_k_pos, None]
            final_hidden_states.index_add_(0, token_idx, current_hidden_states.to(...))

        return final_hidden_states
9. GlmMoeDsaMoE — routed pool + always-on shared expert
The composite MoE module that the decoder layer actually instantiates. It owns the router, the routed experts, and the shared expert. The shared expert receives the original residual (not the post-norm input that the routed experts see -- but that distinction is hidden because residuals = hidden_states here is the input after post_attention_layernorm).
class GlmMoeDsaMoE(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.experts = GlmMoeDsaNaiveMoe(config)
        self.gate    = GlmMoeDsaTopkRouter(config)
        self.shared_experts = GlmMoeDsaMLP(
            config=config,
            intermediate_size=config.moe_intermediate_size * config.n_shared_experts,  # 2048*1
        )

    def forward(self, hidden_states):
        residuals = hidden_states
        orig_shape = hidden_states.shape
        router_logits = self.gate(hidden_states)
        topk_indices, topk_weights = self.route_tokens_to_experts(router_logits)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        hidden_states = self.experts(hidden_states, topk_indices, topk_weights).view(*orig_shape)
        hidden_states = hidden_states + self.shared_experts(residuals)
        return hidden_states
10. GlmMoeDsaDecoderLayer — classic pre-norm transformer block
Two RMSNorms, two residuals, attention, then dense MLP or MoE depending on config.mlp_layer_types[layer_idx]. Inherits from Glm4MoeLiteDecoderLayer; the only thing GLM-5.1 changes is which attention class is instantiated (GlmMoeDsaAttention with the embedded indexer).
class GlmMoeDsaDecoderLayer(GradientCheckpointingLayer):
    def __init__(self, config, layer_idx):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.self_attn = GlmMoeDsaAttention(config, layer_idx)

        if config.mlp_layer_types[layer_idx] == "sparse":
            self.mlp = GlmMoeDsaMoE(config)
        else:
            self.mlp = GlmMoeDsaMLP(config)

        self.input_layernorm           = GlmMoeDsaRMSNorm(config.hidden_size, config.rms_norm_eps)
        self.post_attention_layernorm  = GlmMoeDsaRMSNorm(config.hidden_size, config.rms_norm_eps)

    def forward(self, hidden_states, attention_mask=None, position_ids=None,
                past_key_values=None, use_cache=False, position_embeddings=None, **kwargs):
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        hidden_states, _ = self.self_attn(
            hidden_states=hidden_states, attention_mask=attention_mask,
            position_ids=position_ids, past_key_values=past_key_values,
            use_cache=use_cache, position_embeddings=position_embeddings, **kwargs)
        hidden_states = residual + hidden_states

        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states
        return hidden_states
11. GlmMoeDsaModel — embedding, 78 decoder layers, final norm
Top-level model body. Untied embeddings (no * sqrt(d) scaling). Builds 78 layers, runs them in sequence, applies a final RMSNorm. Cache management uses transformers' standard DynamicCache -- the indexer's separate cache lives inside each attention module, invisible to the model body.
class GlmMoeDsaModel(GlmMoeDsaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList(
            [GlmMoeDsaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        self.norm = GlmMoeDsaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.rotary_emb = GlmMoeDsaRotaryEmbedding(config=config)
        self.gradient_checkpointing = False
        self.post_init()

    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, use_cache=None, **kwargs):
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
        if use_cache and past_key_values is None:
            past_key_values = DynamicCache(config=self.config)
        ...
        causal_mask = create_causal_mask(...)
        hidden_states = inputs_embeds
        position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)

        for decoder_layer in self.layers[:self.config.num_hidden_layers]:
            hidden_states = decoder_layer(
                hidden_states,
                attention_mask=causal_mask,
                position_embeddings=position_embeddings,
                position_ids=position_ids,
                past_key_values=past_key_values,
                use_cache=use_cache, **kwargs)

        hidden_states = self.norm(hidden_states)
        return BaseModelOutputWithPast(last_hidden_state=hidden_states,
                                       past_key_values=past_key_values)
12. GlmMoeDsaForCausalLM — untied lm_head, no logit cap
A standard causal LM head: a single linear from hidden_size to vocab_size, no bias, no logit soft-capping (unlike Gemma 4 which clamps logits with tanh(x/30)*30). The lm_head is not tied to the input embedding -- both are full 6144 × 154880 matrices, contributing ~1.9B parameters between them.
class GlmMoeDsaForCausalLM(GlmMoeDsaPreTrainedModel, GenerationMixin):
    _tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}  # placeholder, tie_word_embeddings=False

    def __init__(self, config):
        super().__init__(config)
        self.model = GlmMoeDsaModel(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
        self.post_init()
13. GlmMoeDsaPreTrainedModel — FP8 escape hatches and flash-mla integration
Sets up the FP8/fp32 boundary: indexer.weights_proj is preserved in fp32 (the per-head weighting that gates the top-k selection), and e_score_correction_bias is preserved in fp32 (the routing balancer). Flash attention is disabled -- the only flash backend supported is the dedicated kernels-community/flash-mla kernel which understands the latent format and the topk indices. SDPA is supported as a generic fallback.
class GlmMoeDsaPreTrainedModel(PreTrainedModel):
    config: GlmMoeDsaConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["GlmMoeDsaDecoderLayer"]

    _supports_flash_attn = False  # flash-mla kernels need a bit more work...
    _supports_sdpa = True
    _supports_flex_attn = False
    _can_compile_fullgraph = True
    _supports_attention_backend = True

    # FP8 quantization uses _keep_in_fp32_modules to decide what NOT to convert
    _keep_in_fp32_modules         = ["indexer.weights_proj"]
    _keep_in_fp32_modules_strict  = ["e_score_correction_bias"]
    _keys_to_ignore_on_load_unexpected = [r"model\.layers\.78.*"]
    _compatible_flash_implementations  = ["kernels-community/flash-mla"]
14. attribute_map — how RoPE finds the rope-only head_dim
A small but consequential trick. Transformers' RotaryEmbedding reads config.head_dim; without intervention this would be the full 256-dim head. But GLM-5.1 only wants RoPE applied to the 64-dim rope subspace. The attribute_map = {"head_dim": "qk_rope_head_dim"} at config-class level rewrites every external read of config.head_dim to return config.qk_rope_head_dim instead, so the RoPE generator emits cos/sin tensors of length 64. Same trick used in DeepSeek V3.
attribute_map = {
    "num_local_experts": "n_routed_experts",   # for MoE generic interfaces
    "head_dim":          "qk_rope_head_dim",   # for RotaryEmbedding -- only the rope subspace
}
15. base_model_tp_plan — tensor-parallel sharding plan
The config also declares the TP plan for distributed inference. Notice the special "mla_kv_a_proj" shard type for kv_a_proj_with_mqa -- it has to be sharded carefully because the output is a concatenation of the 512-dim KV latent and the 64-dim k_pe stream, which need different replication strategies. The router is not sharded (each TP rank computes the full 256-expert score). The experts use a custom "moe_tp_experts" plan that distributes the 256 experts across ranks.
base_model_tp_plan = {
    "layers.*.self_attn.q_b_proj":             "colwise",
    "layers.*.self_attn.kv_a_proj_with_mqa":   "mla_kv_a_proj",
    "layers.*.self_attn.kv_b_proj":            "colwise",
    "layers.*.self_attn.o_proj":               "rowwise",
    "layers.*.mlp.experts.gate_up_proj":       "packed_colwise",
    "layers.*.mlp.experts.down_proj":          "rowwise",
    "layers.*.mlp.experts":                    "moe_tp_experts",
    "layers.*.mlp.shared_experts.gate_proj":   "colwise",
    "layers.*.mlp.shared_experts.up_proj":     "colwise",
    "layers.*.mlp.shared_experts.down_proj":   "rowwise",
    "layers.*.mlp.gate_proj":                  "colwise",   # dense layers
    "layers.*.mlp.up_proj":                    "colwise",
    "layers.*.mlp.down_proj":                  "rowwise",
}
16. mlp_layer_types — 3 dense + 75 sparse, post-init computed
Set in __post_init__ if the user did not provide it: the first min(3, num_layers) layers are dense, the rest are sparse. This is the difference between GLM-5.1 (3 dense) and GLM-5/lite (1 dense). The list is checked at decoder-layer instantiation time.
def __post_init__(self, **kwargs):
    self.qk_head_dim = self.qk_nope_head_dim + self.qk_rope_head_dim    # 192 + 64 = 256

    # MLP layer types: first 3 dense, rest sparse
    if self.mlp_layer_types is None:
        self.mlp_layer_types = (
            ["dense"] * min(3, self.num_hidden_layers)
            + ["sparse"] * (self.num_hidden_layers - 3)
        )
    super().__post_init__(**kwargs)
17. flash-mla integration — topk_indices kwarg
When the attention interface is the dedicated flash-mla kernel, the indexer's topk_indices tensor is forwarded as a kwarg. flash-mla can then skip the gather/scatter mask machinery entirely: it indexes directly into the cached K/V tensors via the top-k indices, multiplies only against those keys, and returns the attention output. The non-flash code path has to materialize the full -inf mask, but the optimized kernel doesn't.
attn_output, attn_weights = attention_interface(
    self,
    query_states,
    key_states,
    value_states,
    combined_mask,
    dropout=0.0 if not self.training else self.attention_dropout,
    scaling=self.scaling,
    indices=topk_indices,        # flash_mla_with_kvcache reads this
    **kwargs,
)
18. Indexer cache reset on prefill
A subtle correctness rule. The indexer maintains its own key cache across decode steps, growing it by one per forward. But on a fresh prompt (prefill, where seq_len > 1) the cache must be reset -- otherwise stale keys from a previous request would pollute the top-k selection. The check is a single if seq_len > 1: self._cached_keys = None at the top of indexer.forward.
# Reset cache on prefill (new prompt) to avoid stale keys / batch-size mismatch
if seq_len > 1:
    self._cached_keys = None

if use_cache:
    if self._cached_keys is not None:
        k_cached = torch.cat([self._cached_keys, k], dim=1)  # [B, T, D]
    else:
        k_cached = k
    self._cached_keys = k_cached
else:
    k_cached = k
19. _keys_to_ignore_on_load_unexpected: layer 78 absent from checkpoint
A small footprint detail in the checkpoint. The published 754B model has 78 hidden layers (indexed 0-77), but the checkpoint may contain a "78" placeholder from earlier training that should be ignored on load. The pattern r"model\.layers\.78.*" tells from_pretrained to silently skip those keys. Identical mechanism to GLM-4.6 which uses r"model\.layers\.46.*" for the same reason.
_keys_to_ignore_on_load_unexpected = [r"model\.layers\.78.*"]
20. Indexer reuses MLA's q_resid — param efficiency
A clever piece of weight sharing. The Indexer's "query side" needs a low-rank latent of the input, but instead of adding its own q_a_proj + q_a_layernorm, it reuses MLA's q_resid -- the output of the main attention's q_a_layernorm(q_a_proj(hidden_states)). The Indexer only owns wq_b (a single 2048→4096 linear) for its query side. This saves ~12.6M params per layer (~1B model-wide) and guarantees the indexer sees the same latent representation as the main attention.
# In GlmMoeDsaAttention.forward:
q_resid = self.q_a_layernorm(self.q_a_proj(hidden_states))   # [B, S, q_lora_rank]
query_states = self.q_b_proj(q_resid)
...
# Same q_resid is forwarded to the indexer:
topk_indices = self.indexer(
    hidden_states,
    q_resid,                                                  # ← reused!
    position_embeddings,
    indexer_mask,
    use_cache=past_key_values is not None,
)
Architecture Stack — one full pass through GLM-5.1
Tokenizer (vocab=154,880) SentencePiece + special tokens embed_tokens 154,880 → 6,144 (untied with lm_head) Decoder Layer (×78) layers 0-2: Dense MLP  |  layers 3-77: MoE input_layernorm (RMSNorm) eps=1e-5, weight=ones GlmMoeDsaAttention — MLA + DSA Indexer Q PATH q_a_proj  6144→2048 q_a_layernorm (RMS) ↳ q_resid → DSA Indexer q_b_proj  2048→64×256 split [nope=192 | pe=64] RoPE on q_pe → cat back KV PATH kv_a_proj_with_mqa 6144 → 512+64 split [latent=512 | k_pe=64] kv_a_layernorm (RMS) on latent kv_b_proj 512 → 64×448 split [k_nope=192 | v=256] DSA INDEXER wq_b 2048 → 32×128 (uses q_resid) wk 6144 → 128 + LayerNorm RoPE on q_pe[64] / k_pe[64] scoring: ReLU(q·k) · weights_proj topk=2048 → indices own _cached_keys (separate cache) SDPA(Q,K,V) + (DSA mask + causal mask) o_proj 64×256 → 6144 + residual post_attention_layernorm (RMS) DENSE PATH (layers 0-2) GlmMoeDsaMLP (SwiGLU) gate_proj 6144 → 12288 up_proj 6144 → 12288 down_proj 12288 → 6144 MOE PATH (layers 3-77) router (sigmoid + e_score_correction_bias) topk=8 of 256, normalize, ×2.5 256× routed experts (SwiGLU, h=2048) + shared expert (always-on, SwiGLU h=2048) + residual (after MLP/MoE) → next layer final norm (RMSNorm) lm_head (Linear, no bias) 6,144 → 154,880 (untied) logits → softmax → next token no logit cap, no length penalty Repeated 78 times. The DSA Indexer is the only component without a Gemma 4 / Llama 3 analog. ~754B params total / ~40B active per token / 200K context
MADL Architecture Diagram

Rendered live from models/glm_5_1.madl via the same MADL parser and SVG renderer used by gemma4.html and the dashboard. The MADL string declares the architecture; the JavaScript below interprets it as a vertical block stack with attention/MoE substructure expanded inline. View the source MADL here.