glm_moe_dsa — 754B-A40B coding/agentic flagship — MIT licenseGLM-5.1 is Z.ai (Zhipu)'s second-generation agentic-coding flagship and the first member of the glm_moe_dsa model family in HuggingFace transformers. It is a 754B total / ~40B active sparse Mixture-of-Experts decoder, trained on 28.5 trillion tokens, optimized for long-horizon agentic engineering: hundreds of rounds, thousands of tool calls, full repository edits. Architecturally, it sits at the intersection of three independent lineages -- it inherits its body from GLM-4.6, its attention from DeepSeek V3, and its sparse-attention indexer from DeepSeek V3.2 -- and combines them into a single unified design.
Every GLM-5.1 decoder layer is built from three independent technologies, each addressing a different bottleneck of trillion-parameter models:
kv_lora_rank=512), and the query through a separate latent (q_lora_rank=2048). At decode time the cached state is the compressed latent, not the expanded K/V -- a roughly ~10x KV-cache reduction versus a comparable GQA model. RoPE is applied only to a small decoupled subspace (qk_rope_head_dim=64) of each head.
-inf. This makes the per-token attention cost independent of sequence length beyond ~2K, enabling the published 200K context.
e_score_correction_bias. Each routed expert is small (moe_intermediate_size=2048) -- only 1/6 of the dense FFN width -- so 8 routed experts deliver only ~1.3x the compute of one dense FFN, while the model has access to the full 256-expert pool. The shared expert provides always-on baseline capacity.
intermediate_size=12288) instead of MoE. Inherited from the GLM-4 family. The remaining 75 layers are sparse. The rationale (originally from DeepSeek-V3): early layers do general feature extraction where routing decisions are unstable, so the gradient signal benefits from a fully dense path before routing kicks in.
Each layer applies pre-norm RMSNorm, then MLA+DSA self-attention, residual, post-norm RMSNorm, then dense-or-MoE MLP, residual. The structure is otherwise classical: there are no parallel residual streams, no per-layer embeddings, no sliding/global hybrid -- just one homogeneous block repeated 78 times. The novelty is entirely in what happens inside the attention and MLP subblocks.
GLM-5.1's predecessor inside transformers is glm4_moe_lite (the GLM-5 architecture, also MLA-based). Before that, GLM-4.5/4.6 used the simpler glm4_moe (standard GQA with QK-norm, inherited from Cohere/DeepSeek-V3). The deltas at each step:
q_lora=768, kv_lora=512, head split into nope=192 + rope=64). Doubled context to ~200K. Switched routing to top-4 of 64 experts (lite) / top-8 of 256 (full). Adopted interleaved RoPE.GlmMoeDsaIndexer: a small MLP (32 heads × 128 dim, separate from main attention) that produces a per-query top-2048 mask. The mask is added to the causal mask as a -inf/0 sparse pattern. Mainline attention is then computed only over those 2048 keys per query, even if the cache contains 200K. Borrowed directly from DeepSeek V3.2.q_lora_rank grew from 768 → 2048. The query latent has to feed both the main attention and the new Indexer's wq_b, so it needs more capacity.rope_interleave=True (DeepSeek V3 style: alternate even/odd pairs). GLM-5.1 explicitly removes that attribute (rope_interleave = AttributeError()) and uses split-half NeoX/Llama RoPE. The Indexer applies the same NeoX RoPE to its own decoupled q_pe/k_pe.mlp_layer_types default went from ["dense"] + ["sparse"]*(L-1) to ["dense"]*3 + ["sparse"]*(L-3). The GLM-5.1 paper attributes this to improved router stability at trillion-param scale.routed_scaling_factor raised from 1.8 → 2.5. The factor multiplies the post-normalization expert weights, inflating the contribution of routed experts relative to the residual stream. With deeper models and more experts, the per-expert weights shrink (one of 8 of 256 instead of one of 4 of 64), so the scaling factor compensates.max_position_embeddings=202,752 (~200K).indexer.weights_proj stays in plain bf16/fp32 because the reference implementation uses fp32 for it and the FP8 quantizer uses _keep_in_fp32_modules to preserve it.| Parameter | GLM-5.1 (glm_moe_dsa) | GLM-5 (glm4_moe_lite defaults) | GLM-4.6 (glm4_moe defaults) | DeepSeek V3.2 |
|---|---|---|---|---|
| Total Params | ~754B | ~745B (flagship) | ~355B (flagship) | ~671B |
| Active Params | ~40B | ~32B | ~32B | ~37B |
| Context | 202,752 | 202,752 | 131,072 | 163,840 |
| Vocab | 154,880 | 154,880 | 151,552 | 129,280 |
| Hidden Size | 6,144 | 2,048 | 4,096 | 7,168 |
| Layers | 78 | 47 | 46 | 61 |
| Dense Layers | 3 | 1 | 1 | 3 |
| MoE Layers | 75 | 46 | 45 | 58 |
| Attention Type | MLA + DSA | MLA | GQA | MLA + DSA |
| Q Heads | 64 | 20 | 96 | 128 |
| KV Heads | 64 (MLA) | 20 (MLA) | 8 | 128 (MLA) |
| q_lora_rank | 2,048 | 768 | -- | 1,536 |
| kv_lora_rank | 512 | 512 | -- | 512 |
| qk_nope_head_dim | 192 | 192 | -- | 128 |
| qk_rope_head_dim | 64 | 64 | -- | 64 |
| v_head_dim | 256 | 256 | head_dim | 128 |
| qk_head_dim (total) | 256 | 256 | head_dim | 192 |
| DSA Indexer | yes | -- | -- | yes |
| index_topk | 2,048 | -- | -- | 2,048 |
| index_n_heads | 32 | -- | -- | 64 |
| index_head_dim | 128 | -- | -- | 128 |
| FFN Type | MoE + shared | MoE + shared | MoE + shared | MoE + shared |
| Dense FFN hidden | 12,288 | 10,240 | 10,944 | 18,432 |
| MoE expert hidden | 2,048 | 1,536 | 1,408 | 2,048 |
| Routed experts | 256 | 64 | 128 | 256 |
| Shared experts | 1 | 1 | 1 | 1 |
| Experts per token | 8 | 4 | 8 | 8 |
| Routed scaling factor | 2.5 | 1.8 | 1.0 | 2.5 |
| norm_topk_prob | True | True | True | True |
| e_score_correction_bias | yes | yes | yes | yes |
| Group routing | 1 group | 1 group | 1 group | 8 groups |
| Activation | SiLU | SiLU | SiLU | SiLU |
| RoPE style | NeoX/Llama (split-half) | interleaved | standard, partial=0.5 | interleaved |
| Norm | RMSNorm (eps=1e-5) | RMSNorm | RMSNorm | RMSNorm |
| QK Norm (in attn) | q_a / kv_a only | q_a / kv_a only | optional | q_a / kv_a only |
| Indexer k_norm | LayerNorm (eps=1e-6) | -- | -- | LayerNorm |
| Attention bias | False | False | False | False |
| Tie Embeddings | False | False | False | False |
| FP8 native | yes (bf16 escape: indexer.weights_proj) | -- | -- | yes |
Source: zai-org/GLM-5.1 model card and z.ai blog post. Comparison numbers are best-public for the named generation; GLM-5.1 sets state-of-the-art on SWE-Bench Pro, CyberGym, and BrowseComp.
| Benchmark | GLM-5.1 | GLM-5 | GLM-4.6 | Claude 3.7 Sonnet (ref.) |
|---|---|---|---|---|
| SWE-Bench Pro (verified) | 58.4% (SOTA) | 53.2% | 40.1% | 54.7% |
| NL2Repo | 42.7 | 35.9 | -- | -- |
| Terminal-Bench 2.0 | 63.5 (66.5*) | 54.6 | -- | 62.3 |
| CyberGym | 68.7% (SOTA) | 52.1% | -- | -- |
| BrowseComp | 68.0% (SOTA) | 61.4% | -- | -- |
| AIME 2026 (no tools) | 95.3 | 91.0 | 78.4 | -- |
| GPQA Diamond | 86.2 | 83.4 | 75.7 | 84.8 |
| LiveCodeBench v6 | 75.4 | -- | -- | -- |
*Terminal-Bench 2.0 self-reported via Claude Code-style harness. The most striking improvement is the compounding of long-horizon agent loops: GLM-5.1's training emphasizes "hundreds of rounds" and "thousands of tool calls", which is reflected in SWE-Bench Pro and Terminal-Bench more than in single-shot reasoning tasks.
Computed from the public config (configuration_glm_moe_dsa.py): hidden=6144, q_lora=2048, kv_lora=512, qk_nope=192, qk_rope=64, v_dim=256, 64 heads, 256 routed experts (top-8), shared expert hidden=2048, dense FFN hidden=12288, indexer 32×128. RMSNorm/LayerNorm bias terms are negligible and excluded.
| Component | Params | Formula |
|---|---|---|
| q_a_proj | 12.58M | 6144 × 2048 |
| q_a_layernorm | 2.0K | 2048 (RMSNorm) |
| q_b_proj | 33.55M | 2048 × (64 × 256) |
| kv_a_proj_with_mqa | 3.54M | 6144 × (512 + 64) |
| kv_a_layernorm | 0.5K | 512 (RMSNorm) |
| kv_b_proj | 14.68M | 512 × (64 × (192+256)) |
| o_proj | 100.66M | (64 × 256) × 6144 |
| MLA subtotal | 165.0M | |
| indexer.wq_b | 8.39M | 2048 × (32 × 128) |
| indexer.wk | 0.79M | 6144 × 128 |
| indexer.k_norm | 0.3K | 128 (LayerNorm) |
| indexer.weights_proj | 0.20M | 6144 × 32 |
| Indexer subtotal | 9.4M | |
| Block total | ~174.4M | per layer × 78 = 13.6B |
| Component | Params | Formula |
|---|---|---|
| gate_proj | 75.50M | 6144 × 12288 |
| up_proj | 75.50M | 6144 × 12288 |
| down_proj | 75.50M | 12288 × 6144 |
| SiLU SwiGLU FFN | ~226.5M | 3 × d × 4d |
| input_layernorm | 6.1K | RMSNorm 6144 |
| post_attention_layernorm | 6.1K | RMSNorm 6144 |
| Dense layer total | ~400.9M | attn + dense FFN + 2 norms |
6144 × 2 = 12288, narrower than the 4× convention. The first three layers ground the residual stream in general features before sparse routing kicks in.| Component | Params | Formula |
|---|---|---|
| shared expert (1×) | 37.75M | 3 × 6144 × 2048 |
| routed experts (256×) | 9,663.7M | 256 × 3 × 6144 × 2048 |
| router.weight | 1.57M | 256 × 6144 |
| e_score_correction_bias | 0.3K | 256 (fp32 buffer) |
| FFN total capacity | ~9.70B | shared + 256 routed + router |
| FFN active (top-8 + shared) | ~341.3M | (8 + 1) × 37.75M + router |
| input_layernorm | 6.1K | RMSNorm 6144 |
| post_attention_layernorm | 6.1K | RMSNorm 6144 |
| MoE layer capacity | ~9.87B | attn + MoE + 2 norms |
| MoE layer active | ~515.7M | attn + (top-8+shared) |
| Component | Capacity | Active / token |
|---|---|---|
| embed_tokens | 951.4M | ~6.1K (1 row) |
| 3 dense layers | 1.20B | 1.20B |
| 75 MoE layers | 740.3B | 38.7B |
| final norm | 6.1K | 6.1K |
| lm_head (untied) | 951.4M | 951.4M |
| Total | ~743.6B* | ~40.8B |
Estimates for the full 754B model. KV cache numbers reflect MLA's compressed cache: per token only the latent kv_lora_rank=512 + qk_rope_head_dim=64 = 576 elements per layer are actually needed for decode (the reference implementation can run from compressed). The current transformers implementation expands K/V before storing, so the figures below show both regimes.
| Precision | GLM-5.1 754B | GLM-5 ~745B | GLM-4.6 ~355B | DSV3.2 671B |
|---|---|---|---|---|
| BF16 | ~1,508 GB | ~1,490 GB | ~710 GB | ~1,342 GB |
| FP8 (native) | ~754 GB | ~745 GB | ~355 GB | ~671 GB |
| INT4 | ~377 GB | ~373 GB | ~178 GB | ~336 GB |
| Context | Expanded (HF default) | MLA compressed | Reduction |
|---|---|---|---|
| 4K | ~3.7 GB | ~0.36 GB | ~10.3× |
| 32K | ~30.0 GB | ~2.9 GB | ~10.3× |
| 128K | ~120 GB | ~11.5 GB | ~10.4× |
| 200K | ~187 GB | ~18.0 GB | ~10.4× |
Expanded: 78 layers × 64 heads × (256 + 256) elems × 2 bytes per token = ~5.13 MB/token. Compressed (latent only): 78 layers × (512 + 64) elems × 2 bytes = ~89.9 KB/token. The Indexer maintains its own ~1 KB/token side cache (78 layers × 128 dims × 2 bytes), negligible.
| Scenario | Weights | +KV @128K | Total | Hardware |
|---|---|---|---|---|
| FP8, expanded KV | 754 GB | +120 GB | 874 GB | 11× H100 80GB or 6× H200 141GB |
| FP8, MLA compressed | 754 GB | +11.5 GB | 765 GB | 10× H100 80GB or 6× H200 141GB |
| INT4, MLA compressed | 377 GB | +5.7 GB | 383 GB | 5× H100 80GB or 3× H200 141GB |
| FP8 + DSA active KV (~2K) | 754 GB | +0.18 GB | 754 GB | 10× H100 80GB |
The DSA Indexer doesn't reduce KV-cache storage -- the cache still holds all 200K tokens. What it reduces is compute: the main attention only multiplies queries against ~2K selected keys per query position, regardless of context length. This makes DSA orthogonal to MLA: MLA compresses what is stored, DSA compresses what is read. Both are needed for practical 200K-context decode at 754B scale.
q_a_proj compresses hidden → 2048, then q_b_proj expands to 64 heads × 256 dim. kv_a_proj_with_mqa packs the kv latent (512) and the rope-only k stream (64) in a single matmul; kv_b_proj expands the latent into both K-nope (192) and V (256) per head. The cached state can be just the latents (~10× smaller than expanded GQA), though the HF transformers code currently stores expanded K/V for backend compatibility.
wq_b. Its key projection wk reads directly from hidden_states with its own LayerNorm (eps=1e-6, distinct from RMSNorm). Per-head weights from weights_proj (kept in fp32 even when the rest of the model is FP8) fuse the head scores via a sum.
DynamicCache is sized to exactly num_hidden_layers attention layers, leaving no room for the indexer's keys. So the indexer stores its own _cached_keys tensor as a plain attribute, concatenating along sequence dimension on each decode step. On prefill (when seq_len > 1) the cache is reset to avoid stale data. This is invisible to user code but a critical correctness detail.
-inf matrix of shape [B, S, T], then scatter_-ing zeros into the top-2048 positions selected by the indexer. The result is added to the regular causal mask. The main attention then runs a normal SDPA forward, and any non-top-k tokens contribute zero through the softmax. The indices=topk_indices kwarg is also forwarded to specialized flash-mla kernels (kernels-community/flash-mla) which can skip the masked positions entirely.
relu(softmax_scale · q·k) · head_weight, summed over its 32 heads. The ReLU is the key non-linearity -- it lets negative dot products contribute exactly zero rather than dragging the score down, and matches what the FP8 reference kernel does. weights_proj produces per-token, per-head weights from hidden_states, scaled by n_heads^(-0.5).
rope_interleave = AttributeError() in the modular config) and adopts split-half RoPE: the head_dim is split in two halves and the second half is rotated against the first. This matches Llama, GPT-NeoX, and most other transformers models, simplifying interop.
e_score_correction_bias (kept in fp32, initialized to zero, updated heuristically during training). The bias only influences which experts are selected, not the post-selection weights. This eliminates auxiliary load-balancing losses and lets the model train without dropping any tokens.
routed_scaling_factor=2.5. The shared expert output is added without the routed scaling. Net effect: the residual stream sees roughly 8× expert capacity for routed work plus a constant baseline from the shared expert.
n_group=1)
The router still runs the DeepSeek-V3 group-routing logic -- partition the 256 experts into n_group groups, select top-2 experts per group to compute group scores, pick the top topk_group groups, then top-k experts within selected groups. With n_group=1 and topk_group=1 (the GLM-5.1 default) this collapses to plain top-8 over all 256 experts. The machinery is preserved for forward compatibility.
intermediate_size=12288; layers 3-77 are MoE. The dense prefix gives the early residual stream a stable, smooth feature space before sparse routing introduces gradient noise. mlp_layer_types = ["dense"]*3 + ["sparse"]*75 -- doubled from GLM-5's single dense layer.
gate_up_proj[num_experts, 2*intermediate, hidden] and down_proj[num_experts, hidden, intermediate]. The forward dispatches each token to its top-k experts via a one-hot mask + Python loop over hit experts. This is slow but correct; production deployment uses fused kernels via the @use_experts_implementation decorator.
indexer.weights_proj, listed in _keep_in_fp32_modules. The reference implementation uses fp32 for it because the weights gate per-head contributions to a sparse top-k selection -- small precision errors there cascade into incorrect token selections. The model also keeps e_score_correction_bias in fp32 via _keep_in_fp32_modules_strict.
tie_word_embeddings=False. The 951M-parameter input embedding and 951M-parameter lm_head are independent matrices. With a 154,880 vocab and 6,144 hidden, each is 6.144 × 154.88 ≈ 951M parameters -- ~1.9B total just for token-IO.
max_position_embeddings=202752, no YARN, no NTK, no length extrapolation tricks. The context is supported natively because (a) MLA's compressed cache makes long sequences memory-tractable, and (b) DSA's top-2048 selection makes them compute-tractable. Without DSA the attention compute would be quadratic in sequence length even with MLA.
modular_glm_moe_dsa.py
The implementation lives in a "modular" file -- a transformers convention where the new model imports symbols from related models and only overrides what differs. GlmMoeDsaConfig inherits from Glm4MoeLiteConfig; the decoder layer inherits from Glm4MoeLiteDecoderLayer; the model body inherits from Glm4MoeModel. Only the attention class, the indexer, and the config additions are new code. The full modeling_glm_moe_dsa.py is auto-generated from this modular file by CI.
MLA is a 2024 invention by DeepSeek that compresses keys and values into a low-rank latent before per-head expansion. GLM-5.1 inherits the design directly from glm4_moe_lite and through it from DeepSeek V3. Unlike GQA -- which reduces the number of KV heads -- MLA reduces the rank of the K/V projections, then uses a small upcasting matrix (kv_b_proj) to recover full per-head representations on demand. This decouples cache size from head count and lets GLM-5.1 keep all 64 heads "full" while caching only ~580 elements per token per layer.
x → q_a_proj → q_a_layernorm → q_b_proj. The q_resid intermediate (after the layer-norm) is reused by the DSA indexer below, so MLA's query LoRA is shared between main attention and the sparsity selector. Output is reshaped to [B, H, S, qk_head_dim] = [B, 64, S, 256], then split into nope (192) + rope (64).
kv_a_proj_with_mqa produces [B, S, 576] -- the concatenation of the 512-dim KV latent and the 64-dim shared K-rope stream. The latent is normed by kv_a_layernorm then expanded by kv_b_proj into [B, S, 64 × (192+256)], which is then split into K-nope and V. The K-rope stream stays as a single shared head across all 64 query heads, broadcast at the dot-product step.
kernels-community/flash-mla) consume the latent directly.
| Aspect | GLM-4.6 GQA (96/8 heads, d=128) | GLM-5.1 MLA (64 heads, d=256) |
|---|---|---|
| Q params | ~50.3M (4096×96×128) | ~46.1M (q_a + q_b LoRA path) |
| K params | ~4.2M (4096×8×128) | ~3.5M + 14.7M (kv_a + kv_b shared with V) |
| V params | ~4.2M | (shared with K via kv_b_proj) |
| O params | ~50.3M | ~100.7M |
| Cache (per token) | 2,048 elems | 576 elems (latent) / 32,768 (expanded) |
| Compute (decode) | 96 × (1+T)×128 dot products | 64 × (1+T)×256 dot products |
The DSA Indexer is the single biggest novelty separating GLM-5.1 from GLM-5. Borrowed directly from DeepSeek V3.2, it's a small parallel network -- only ~9.4M params per layer -- that scores every past token against the current query and selects the top 2,048 to actually attend to. The main attention then runs on a sparse mask: queries see only the indexer-selected keys, everything else is -inf. This makes per-token attention compute roughly constant beyond ~2K context.
q_resid as input, avoiding a redundant linear, and reads keys from raw hidden_states through its own wk.
score[s,t] = Σ_h weights[s,h] · ReLU(softmax_scale · q[s,h]·k[t])where
weights[s,h] comes from a separate weights_proj linear (kept in fp32). The ReLU is critical -- it prevents negative dot products from polluting positive ones during the per-head sum, and matches the FP8 reference kernel's behavior.
topk_indices of shape [B, S, 2048], the main attention mask is built by:
index_mask = full(-inf, [B, S, T])index_mask.scatter_(-1, topk_indices, 0.0)combined = index_mask + causal_maskDynamicCache is sized to exactly num_hidden_layers attention layers, with no slot for the indexer. So the indexer maintains its own _cached_keys tensor as a regular Python attribute. On prefill (seq_len > 1) it resets the cache; on decode it concatenates new keys along the sequence dimension. The cached state is only the indexer's small 128-dim post-norm keys -- ~256 bytes per token per layer in bf16, ~20 KB for a 200K context -- effectively free.
def indexer.forward(hidden_states, q_resid, position_embeddings, mask, use_cache):
cos, sin = position_embeddings
# === Queries (reuse MLA's q_resid latent) ===
q = wq_b(q_resid).view(B, S, 32, 128)
q_pe, q_nope = split(q, [64, 64], dim=-1)
q_pe = apply_rotary_pos_emb(q_pe, cos, sin, unsqueeze_dim=2)
q = cat([q_pe, q_nope], dim=-1) # [B, S, 32, 128]
# === Keys (own projection from raw hidden_states) ===
k = k_norm(wk(hidden_states)) # LayerNorm, eps=1e-6
k_pe, k_nope = split(k, [64, 64], dim=-1)
k_pe = apply_rotary_pos_emb(k_pe.unsqueeze(2), cos, sin, dim=2).squeeze(2)
k = cat([k_pe, k_nope], dim=-1) # [B, S, 128]
# === Indexer's own KV cache (NOT in DynamicCache) ===
if seq_len > 1:
self._cached_keys = None # reset on prefill
if use_cache:
k_cached = cat([self._cached_keys, k], dim=1) if self._cached_keys is not None else k
self._cached_keys = k_cached
else:
k_cached = k
# === Score (FP32 in critical path) ===
weights = weights_proj(hidden_states).float() * (32**-0.5) # [B, S, 32]
scores = einsum("bshd,btd->bsht", q.float(), k_cached.float()) * (128**-0.5)
scores = F.relu(scores)
index_scores = einsum("bsht,bsh->bst", scores, weights) # [B, S, T]
if mask is not None:
index_scores = index_scores + mask # apply causal
return index_scores.topk(min(2048, T), dim=-1).indices # [B, S, 2048]
GLM-5.1 inherits its routing mechanism from DeepSeek V3 via the GLM-4 lineage. There are two distinguishing choices: (a) the per-expert score is a sigmoid probability, not a softmax, and (b) selection (but not weighting) is biased by a learnable per-expert bias e_score_correction_bias. The bias is updated by an external balancing rule during training -- not by gradient descent -- so the model trains without any auxiliary load-balancing loss. This is the "auxiliary-loss-free" balancing of DeepSeek V3.
router_logits = hidden_states @ router.weight.T in fp32, then applies sigmoid(). Each expert independently produces a score in [0,1], so multiple experts can be "highly relevant" without competing for normalization mass like softmax would force.
e_score_correction_bias (initialized to zero, updated heuristically during training) is added to the sigmoid scores -- but only for the purpose of choosing which experts to fire. Underused experts get bumped up; overused experts get pushed down. The selection bias prevents expert collapse without any gradient on the bias.
n_group groups; each group's score is the sum of its top-2 corrected scores; the top topk_group groups are kept; everything outside those groups is masked out; finally topk(k=8) picks the active experts. For GLM-5.1 n_group=topk_group=1, so this collapses to plain top-8 over all 256 experts. The machinery is preserved for compatibility.
router_logits.sigmoid()), not the corrected score. So the bias only changes which experts fire, never how much they contribute. After top-k, the weights are normalized to sum to 1 then multiplied by routed_scaling_factor=2.5.
routed_output + shared_output -- the shared expert is added without the routing scale, so it behaves like a constant bias path that the routed pool augments.
def forward(hidden_states):
residuals = hidden_states
router_logits = self.gate(hidden_states) # [B*S, 256], fp32
topk_idx, topk_w = self.route_tokens_to_experts(router_logits)
routed = self.experts(flat_x, topk_idx, topk_w).view(*shape)
shared = self.shared_experts(residuals) # always-on
return routed + shared
def route_tokens_to_experts(router_logits):
# 1) Sigmoid (NOT softmax)
router_logits = router_logits.sigmoid() # [N, 256]
# 2) Bias-corrected scores for SELECTION ONLY
scores_for_choice = router_logits + self.gate.e_score_correction_bias # [N, 256]
# 3) Group routing (degenerate at n_group=1)
group_scores = (
scores_for_choice.view(N, 1, 256).topk(2, dim=-1)[0].sum(dim=-1)
) # [N, 1]
group_idx = group_scores.topk(k=1, dim=-1, sorted=False)[1]
group_mask = zeros_like(group_scores)
group_mask.scatter_(1, group_idx, 1) # all groups kept
score_mask = group_mask.unsqueeze(-1).expand(-1, 1, 256).reshape(-1, 256)
masked_scores = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)
# 4) Top-k (k=8)
topk_indices = masked_scores.topk(k=8, dim=-1, sorted=False)[1] # [N, 8]
# 5) Re-gather UNCORRECTED sigmoid weights (bias is for selection only!)
topk_weights = router_logits.gather(1, topk_indices) # [N, 8]
# 6) Normalize then scale
topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20)
topk_weights = topk_weights * 2.5 # routed_scaling_factor
return topk_indices, topk_weights
Verified against transformers/models/glm_moe_dsa. Code quotes are exact, with comments preserved. Files: configuration_glm_moe_dsa.py, modular_glm_moe_dsa.py, modeling_glm_moe_dsa.py.
q_a_layernorm / kv_a_layernorm use GlmMoeDsaRMSNorm (RMS, with learned weight). The DSA Indexer's k_norm is a standard nn.LayerNorm with eps=1e-6 -- a deliberate departure to match the DeepSeek V3.2 reference.class GlmMoeDsaRMSNorm(nn.Module):
def __init__(self, hidden_size, eps: float = 1e-6) -> None:
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
llama.modeling_llama.rotate_half. Splits the head_dim in two halves; rotates the second half against the first via cos/sin. The function takes a single tensor (not a pair) so it can be applied to q_pe and k_pe independently. The unsqueeze_dim argument is 1 for BHSD layout (main attention) and 2 for BSHD layout (indexer).def apply_rotary_pos_emb(x, cos, sin, unsqueeze_dim: int = 1) -> torch.Tensor:
"""
This is the transformers equivalent of DeepSeek V3.2's `apply_rotary_emb(x, freqs_cis, interleaved)`.
Instead of using complex-number `freqs_cis`, we use pre-split `(cos, sin)` tensors from RotaryEmbedding.
"""
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
# Split-half (NeoX/Llama style): (x[:d/2], x[d/2:])
return (x * cos) + (rotate_half(x) * sin)
attribute_map = {"head_dim": "qk_rope_head_dim"}, so when the rotary embedding asks config.head_dim it actually receives qk_rope_head_dim=64. Combined with the optional partial_rotary_factor from rope_parameters, this lets the RoPE generator emit cos/sin tensors for exactly the rotary subspace, not the full 256-dim head.class GlmMoeDsaRotaryEmbedding(nn.Module):
@staticmethod
def compute_default_rope_parameters(config, device=None, seq_len=None):
base = config.rope_parameters["rope_theta"]
partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0)
head_dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads
dim = int(head_dim * partial_rotary_factor)
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(...) / dim))
return inv_freq, 1.0
q_a/q_b LoRA path; KV is split into a 512-dim latent + 64-dim shared k_pe stream; the latent is normed and expanded by kv_b_proj; the K-pe stream is RoPE'd separately and broadcast across all 64 heads. The Indexer is invoked on q_resid (the q_a output) and the raw hidden_states; its output (top-k indices) becomes a sparse mask combined with the causal mask.# ===== Query path =====
q_resid = self.q_a_layernorm(self.q_a_proj(hidden_states)) # [B,S,2048]
query_states = self.q_b_proj(q_resid)
query_states = query_states.view(B, S, -1, self.qk_head_dim).transpose(1,2)
q_nope, q_pe = torch.split(query_states,
[self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
q_pe = apply_rotary_pos_emb(q_pe, cos, sin, unsqueeze_dim=1)
# ===== KV path =====
compressed_kv = self.kv_a_proj_with_mqa(hidden_states) # [B,S,576]
k_compressed, k_pe = torch.split(compressed_kv,
[self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_compressed = self.kv_a_layernorm(k_compressed) # [B,S,512]
kv_expanded = self.kv_b_proj(k_compressed) # [B,S,64*448]
kv_expanded = kv_expanded.view(B, S, -1, self.qk_nope_head_dim + self.v_head_dim)
k_nope, value_states = torch.split(kv_expanded,
[self.qk_nope_head_dim, self.v_head_dim], dim=-1)
k_nope = k_nope.transpose(1,2)
value_states = value_states.transpose(1,2)
# RoPE on the single shared k_pe stream, then broadcast across heads
k_pe = k_pe.view(B, 1, S, self.qk_rope_head_dim)
k_pe = apply_rotary_pos_emb(k_pe, cos, sin, unsqueeze_dim=1)
k_pe = k_pe.expand(-1, k_nope.shape[1], -1, -1) # [B, 64, S, 64]
query_states = torch.cat([q_nope, q_pe], dim=-1) # [B,64,S,256]
key_states = torch.cat([k_nope, k_pe], dim=-1) # [B,64,S,256]
if past_key_values is not None:
key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx)
# ===== Indexer (DSA sparse mask) =====
indexer_mask = ... # broadcast attention_mask to [B,S,T]
topk_indices = self.indexer(
hidden_states, q_resid, position_embeddings, indexer_mask,
use_cache=past_key_values is not None,
) # [B, S, 2048]
# Build combined DSA + causal mask: -inf except top-k
index_mask = torch.full((B, S, T), float("-inf"), ...)
index_mask.scatter_(-1, topk_indices, 0.0)
index_mask = index_mask.unsqueeze(1) # [B,1,S,T]
combined_mask = index_mask + causal_mask
attn_output, attn_weights = attention_interface(
self, query_states, key_states, value_states, combined_mask,
scaling=self.scaling, indices=topk_indices, **kwargs)
attn_output = self.o_proj(attn_output.reshape(B, S, -1))
[B, H, S, 256] Q and K and [B, H, S, 256] V, the attention computation itself is plain Llama-style SDPA: scaled dot product, mask add, softmax in fp32, dropout, value matmul. repeat_kv is a no-op for GLM-5.1 because num_attention_heads == num_key_value_heads == 64.def eager_attention_forward(module, query, key, value, attention_mask, scaling, dropout=0.0, **kwargs):
key_states = repeat_kv(key, module.num_key_value_groups) # no-op (groups=1)
value_states = repeat_kv(value, module.num_key_value_groups)
attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
if attention_mask is not None:
attn_weights = attn_weights + attention_mask # ← combined DSA+causal here
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
attn_output = torch.matmul(attn_weights, value_states)
return attn_output.transpose(1, 2).contiguous(), attn_weights
intermediate_size=12288); (b) as the shared expert inside GlmMoeDsaMoE with intermediate_size = moe_intermediate_size * n_shared_experts = 2048. No biases.class GlmMoeDsaMLP(nn.Module):
def __init__(self, config, intermediate_size=None):
super().__init__()
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size if intermediate_size is None else intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act] # silu
def forward(self, x):
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
n_routed_experts, computed in fp32 even when the rest of the model is in bf16/FP8. The e_score_correction_bias buffer (also fp32, listed in _keep_in_fp32_modules_strict) is the auxiliary-loss-free balancing knob from DeepSeek V3.class GlmMoeDsaTopkRouter(nn.Module):
def __init__(self, config: GlmMoeDsaConfig):
super().__init__()
self.config = config
self.top_k = config.num_experts_per_tok # 8
self.n_routed_experts = config.n_routed_experts # 256
self.routed_scaling_factor = config.routed_scaling_factor # 2.5
self.n_group = config.n_group # 1
self.topk_group = config.topk_group # 1
self.norm_topk_prob = config.norm_topk_prob # True
self.weight = nn.Parameter(torch.empty((self.n_routed_experts, config.hidden_size)))
self.register_buffer("e_score_correction_bias",
torch.zeros((self.n_routed_experts), dtype=torch.float32))
def forward(self, hidden_states):
hidden_states = hidden_states.view(-1, self.config.hidden_size)
router_logits = F.linear(hidden_states.type(torch.float32),
self.weight.type(torch.float32))
return router_logits
gate_up_proj[256, 4096, 6144] (gate and up packed) and down_proj[256, 6144, 2048]. The forward dispatches each token to its top-k experts via a one-hot mask plus a Python loop over hit experts -- correct but slow. Production deployments replace this via @use_experts_implementation with fused MoE kernels (e.g. SGLang, vLLM).@use_experts_implementation
class GlmMoeDsaNaiveMoe(nn.Module):
def __init__(self, config):
super().__init__()
self.num_experts = config.num_local_experts # 256
self.hidden_dim = config.hidden_size # 6144
self.intermediate_dim = config.moe_intermediate_size # 2048
self.gate_up_proj = nn.Parameter(torch.empty(self.num_experts,
2 * self.intermediate_dim, self.hidden_dim))
self.down_proj = nn.Parameter(torch.empty(self.num_experts,
self.hidden_dim, self.intermediate_dim))
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, hidden_states, top_k_index, top_k_weights):
final_hidden_states = torch.zeros_like(hidden_states)
with torch.no_grad():
expert_mask = F.one_hot(top_k_index, num_classes=self.num_experts).permute(2,1,0)
expert_hit = torch.greater(expert_mask.sum(dim=(-1,-2)), 0).nonzero()
for expert_idx in expert_hit:
expert_idx = expert_idx[0]
top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
current_state = hidden_states[token_idx]
gate, up = F.linear(current_state, self.gate_up_proj[expert_idx]).chunk(2, dim=-1)
current_hidden_states = self.act_fn(gate) * up
current_hidden_states = F.linear(current_hidden_states, self.down_proj[expert_idx])
current_hidden_states = current_hidden_states * top_k_weights[token_idx, top_k_pos, None]
final_hidden_states.index_add_(0, token_idx, current_hidden_states.to(...))
return final_hidden_states
residuals = hidden_states here is the input after post_attention_layernorm).class GlmMoeDsaMoE(nn.Module):
def __init__(self, config):
super().__init__()
self.experts = GlmMoeDsaNaiveMoe(config)
self.gate = GlmMoeDsaTopkRouter(config)
self.shared_experts = GlmMoeDsaMLP(
config=config,
intermediate_size=config.moe_intermediate_size * config.n_shared_experts, # 2048*1
)
def forward(self, hidden_states):
residuals = hidden_states
orig_shape = hidden_states.shape
router_logits = self.gate(hidden_states)
topk_indices, topk_weights = self.route_tokens_to_experts(router_logits)
hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
hidden_states = self.experts(hidden_states, topk_indices, topk_weights).view(*orig_shape)
hidden_states = hidden_states + self.shared_experts(residuals)
return hidden_states
config.mlp_layer_types[layer_idx]. Inherits from Glm4MoeLiteDecoderLayer; the only thing GLM-5.1 changes is which attention class is instantiated (GlmMoeDsaAttention with the embedded indexer).class GlmMoeDsaDecoderLayer(GradientCheckpointingLayer):
def __init__(self, config, layer_idx):
super().__init__()
self.hidden_size = config.hidden_size
self.self_attn = GlmMoeDsaAttention(config, layer_idx)
if config.mlp_layer_types[layer_idx] == "sparse":
self.mlp = GlmMoeDsaMoE(config)
else:
self.mlp = GlmMoeDsaMLP(config)
self.input_layernorm = GlmMoeDsaRMSNorm(config.hidden_size, config.rms_norm_eps)
self.post_attention_layernorm = GlmMoeDsaRMSNorm(config.hidden_size, config.rms_norm_eps)
def forward(self, hidden_states, attention_mask=None, position_ids=None,
past_key_values=None, use_cache=False, position_embeddings=None, **kwargs):
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
hidden_states, _ = self.self_attn(
hidden_states=hidden_states, attention_mask=attention_mask,
position_ids=position_ids, past_key_values=past_key_values,
use_cache=use_cache, position_embeddings=position_embeddings, **kwargs)
hidden_states = residual + hidden_states
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states
* sqrt(d) scaling). Builds 78 layers, runs them in sequence, applies a final RMSNorm. Cache management uses transformers' standard DynamicCache -- the indexer's separate cache lives inside each attention module, invisible to the model body.class GlmMoeDsaModel(GlmMoeDsaPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
self.layers = nn.ModuleList(
[GlmMoeDsaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
)
self.norm = GlmMoeDsaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.rotary_emb = GlmMoeDsaRotaryEmbedding(config=config)
self.gradient_checkpointing = False
self.post_init()
def forward(self, input_ids=None, attention_mask=None, position_ids=None,
past_key_values=None, inputs_embeds=None, use_cache=None, **kwargs):
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
if use_cache and past_key_values is None:
past_key_values = DynamicCache(config=self.config)
...
causal_mask = create_causal_mask(...)
hidden_states = inputs_embeds
position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)
for decoder_layer in self.layers[:self.config.num_hidden_layers]:
hidden_states = decoder_layer(
hidden_states,
attention_mask=causal_mask,
position_embeddings=position_embeddings,
position_ids=position_ids,
past_key_values=past_key_values,
use_cache=use_cache, **kwargs)
hidden_states = self.norm(hidden_states)
return BaseModelOutputWithPast(last_hidden_state=hidden_states,
past_key_values=past_key_values)
hidden_size to vocab_size, no bias, no logit soft-capping (unlike Gemma 4 which clamps logits with tanh(x/30)*30). The lm_head is not tied to the input embedding -- both are full 6144 × 154880 matrices, contributing ~1.9B parameters between them.class GlmMoeDsaForCausalLM(GlmMoeDsaPreTrainedModel, GenerationMixin):
_tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"} # placeholder, tie_word_embeddings=False
def __init__(self, config):
super().__init__(config)
self.model = GlmMoeDsaModel(config)
self.vocab_size = config.vocab_size
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.post_init()
indexer.weights_proj is preserved in fp32 (the per-head weighting that gates the top-k selection), and e_score_correction_bias is preserved in fp32 (the routing balancer). Flash attention is disabled -- the only flash backend supported is the dedicated kernels-community/flash-mla kernel which understands the latent format and the topk indices. SDPA is supported as a generic fallback.class GlmMoeDsaPreTrainedModel(PreTrainedModel):
config: GlmMoeDsaConfig
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["GlmMoeDsaDecoderLayer"]
_supports_flash_attn = False # flash-mla kernels need a bit more work...
_supports_sdpa = True
_supports_flex_attn = False
_can_compile_fullgraph = True
_supports_attention_backend = True
# FP8 quantization uses _keep_in_fp32_modules to decide what NOT to convert
_keep_in_fp32_modules = ["indexer.weights_proj"]
_keep_in_fp32_modules_strict = ["e_score_correction_bias"]
_keys_to_ignore_on_load_unexpected = [r"model\.layers\.78.*"]
_compatible_flash_implementations = ["kernels-community/flash-mla"]
RotaryEmbedding reads config.head_dim; without intervention this would be the full 256-dim head. But GLM-5.1 only wants RoPE applied to the 64-dim rope subspace. The attribute_map = {"head_dim": "qk_rope_head_dim"} at config-class level rewrites every external read of config.head_dim to return config.qk_rope_head_dim instead, so the RoPE generator emits cos/sin tensors of length 64. Same trick used in DeepSeek V3.attribute_map = {
"num_local_experts": "n_routed_experts", # for MoE generic interfaces
"head_dim": "qk_rope_head_dim", # for RotaryEmbedding -- only the rope subspace
}
"mla_kv_a_proj" shard type for kv_a_proj_with_mqa -- it has to be sharded carefully because the output is a concatenation of the 512-dim KV latent and the 64-dim k_pe stream, which need different replication strategies. The router is not sharded (each TP rank computes the full 256-expert score). The experts use a custom "moe_tp_experts" plan that distributes the 256 experts across ranks.base_model_tp_plan = {
"layers.*.self_attn.q_b_proj": "colwise",
"layers.*.self_attn.kv_a_proj_with_mqa": "mla_kv_a_proj",
"layers.*.self_attn.kv_b_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.experts.gate_up_proj": "packed_colwise",
"layers.*.mlp.experts.down_proj": "rowwise",
"layers.*.mlp.experts": "moe_tp_experts",
"layers.*.mlp.shared_experts.gate_proj": "colwise",
"layers.*.mlp.shared_experts.up_proj": "colwise",
"layers.*.mlp.shared_experts.down_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise", # dense layers
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
}
__post_init__ if the user did not provide it: the first min(3, num_layers) layers are dense, the rest are sparse. This is the difference between GLM-5.1 (3 dense) and GLM-5/lite (1 dense). The list is checked at decoder-layer instantiation time.def __post_init__(self, **kwargs):
self.qk_head_dim = self.qk_nope_head_dim + self.qk_rope_head_dim # 192 + 64 = 256
# MLP layer types: first 3 dense, rest sparse
if self.mlp_layer_types is None:
self.mlp_layer_types = (
["dense"] * min(3, self.num_hidden_layers)
+ ["sparse"] * (self.num_hidden_layers - 3)
)
super().__post_init__(**kwargs)
topk_indices tensor is forwarded as a kwarg. flash-mla can then skip the gather/scatter mask machinery entirely: it indexes directly into the cached K/V tensors via the top-k indices, multiplies only against those keys, and returns the attention output. The non-flash code path has to materialize the full -inf mask, but the optimized kernel doesn't.attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
combined_mask,
dropout=0.0 if not self.training else self.attention_dropout,
scaling=self.scaling,
indices=topk_indices, # flash_mla_with_kvcache reads this
**kwargs,
)
seq_len > 1) the cache must be reset -- otherwise stale keys from a previous request would pollute the top-k selection. The check is a single if seq_len > 1: self._cached_keys = None at the top of indexer.forward.# Reset cache on prefill (new prompt) to avoid stale keys / batch-size mismatch
if seq_len > 1:
self._cached_keys = None
if use_cache:
if self._cached_keys is not None:
k_cached = torch.cat([self._cached_keys, k], dim=1) # [B, T, D]
else:
k_cached = k
self._cached_keys = k_cached
else:
k_cached = k
r"model\.layers\.78.*" tells from_pretrained to silently skip those keys. Identical mechanism to GLM-4.6 which uses r"model\.layers\.46.*" for the same reason._keys_to_ignore_on_load_unexpected = [r"model\.layers\.78.*"]
q_a_proj + q_a_layernorm, it reuses MLA's q_resid -- the output of the main attention's q_a_layernorm(q_a_proj(hidden_states)). The Indexer only owns wq_b (a single 2048→4096 linear) for its query side. This saves ~12.6M params per layer (~1B model-wide) and guarantees the indexer sees the same latent representation as the main attention.# In GlmMoeDsaAttention.forward:
q_resid = self.q_a_layernorm(self.q_a_proj(hidden_states)) # [B, S, q_lora_rank]
query_states = self.q_b_proj(q_resid)
...
# Same q_resid is forwarded to the indexer:
topk_indices = self.indexer(
hidden_states,
q_resid, # ← reused!
position_embeddings,
indexer_mask,
use_cache=past_key_values is not None,
)
Rendered live from models/glm_5_1.madl via the same MADL parser and SVG renderer used by gemma4.html and the dashboard. The MADL string declares the architecture; the JavaScript below interprets it as a vertical block stack with attention/MoE substructure expanded inline. View the source MADL here.