For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo struct
MlaConfigV2
struct MlaConfigV2
Shape configuration for MlaPrefillV2Core. Companion to MhaConfigV2.
DeepSeek-V3-style MLA: Q is concatenated q_nope || q_rope at
d_qk = d_nope + d_rope. K is stored in a latent cache at
cache_depth columns wide, with k_nope at [:, :d_nope] and
k_rope at [:, rope_cache_offset:rope_cache_offset + d_rope]
(the gap between the two segments is padded / reserved but counted
in the cache stride). V is v_nope only, so V and O stay at
d_pv = depth, identical to the MHA path β DeepSeek-V3 MLA does
not RoPE V.
The MFMA-shape / SMEM-sub-block / K-loader / V-loader / PV-path
machinery is shared with MhaPrefillV2 via
MhaMmaOp[T, config.mha()]. mha() derives an MhaConfigV2 from
Self for that sharing. MLA's divergence is the Q load at
d_qk, the two K segments, and the cluster schedule that
interleaves k_nope / k_rope DMAs with V.
The latent-cache layout (576-wide with k_rope at offset 512) is
fixed by DeepSeek-V3; matches the existing BF16 MLA path in
mla_prefill.mojo (cache_depth = 576, head_dim_offset = cache_depth - rope_depth = 512).
Fieldsβ
- βq_block_size (
Int): Q rows per warp. - βkv_block (
Int): K/V rows per tile (64 atd_pv=128). - βdepth (
Int): V / O head depth (d_pv = d_nope). For DeepSeek-V3 MLA: 128. - βnum_heads (
Int): Q num_heads. - βnum_kv_heads (
Int): K/V num_heads.1(full GQA) or equal tonum_heads(MHA); other ratios need a stride-aware DMA loader (TODO). - βnum_warps (
Int): Warps per block. - βrescale_threshold (
Float32): Lazy-rescale threshold in log2 units of the running max (identical semantics toMhaConfigV2.rescale_threshold). - βdtype (
DType): Element dtype of Q / K (bothq_nope β₯ q_ropeandk_nope β₯ k_rope) / V input tiles.DType.bfloat16for parity with the existing BF16 MLA prefill;DType.float8_e4m3fnfor the FP8 MLA prefill path. - βoutput_dtype (
DType): Element dtype of the output tileo. FP32 by default; BF16 for inference dispatchers holding a BF16 output buffer. The cast from the FP32 accumulator happens per-lane inside the output store. - βfp8_mma_k_128 (
Bool): Mirror ofMhaConfigV2.fp8_mma_k_128. Architecturally blocked for this attention path by the QK-output / PV-B-input lane geometry mismatch β kept for symmetry so MLA inherits the same comptime hook if a cross-lane shuffle becomes available. - βd_qk (
Int): Q / K depth (d_nope + d_rope). For DeepSeek-V3 MLA: 192. - βd_rope (
Int): RoPE-applied segment depth on Q and K. For DeepSeek-V3 MLA: 64. - βcache_depth (
Int): Latent K cache row width. For DeepSeek-V3 MLA: 576 β the gap betweend_nope(128) andrope_cache_offset(512) is reserved / unused but present in the cache stride. Must match the production latent cache layout; seemla_prefill.mojo:54. - βrope_cache_offset (
Int): Column offset ofk_ropewithin the latent cache row. For DeepSeek-V3 MLA: 512 (layout:k_nopeat[:, :128], gap,k_ropeat[:, 512:576]).
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDeletable,
Movable
Methodsβ
__init__β
def __init__(out self, *, q_block_size: Int, kv_block: Int, depth: Int, num_heads: Int, num_kv_heads: Int, d_qk: Int, d_rope: Int, cache_depth: Int, rope_cache_offset: Int, num_warps: Int = Int(8), rescale_threshold: Float32 = 8, dtype: DType = DType.bfloat16, output_dtype: DType = DType.float32, fp8_mma_k_128: Bool = False)
d_nopeβ
def d_nope(self) -> Int
Returns the non-RoPE segment depth (= depth = d_pv).
For DeepSeek-V3 MLA d_nope == depth == 128. Exposed as an
accessor so MlaPrefillV2Core body code can reference the
nope-segment depth by its semantic name without committing to
an additional field.
Returns:
mhaβ
def mha(self) -> MhaConfigV2
Returns an MhaConfigV2 derived from Self for sharing the MhaMmaOp[T, ...] machinery.
The MFMA shape, SMEM sub-block geometry, K loader, V loader,
and PV path all live on MhaMmaOp β MLA's divergence is
purely in MlaPrefillV2Core (Q load at d_qk, K_rope DMA,
two-segment QK). The derived MhaConfigV2 carries depth = d_pv = d_nope; the MLA-specific d_qk / d_rope /
cache_depth / rope_cache_offset fields stay on
MlaConfigV2 only.
Returns:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!