IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

MlaConfigV2

struct MlaConfigV2

Shape configuration for MlaPrefillV2Core. Companion to MhaConfigV2.

DeepSeek-V3-style MLA: Q is concatenated q_nope || q_rope at d_qk = d_nope + d_rope. K is stored in a latent cache at cache_depth columns wide, with k_nope at [:, :d_nope] and k_rope at [:, rope_cache_offset:rope_cache_offset + d_rope] (the gap between the two segments is padded / reserved but counted in the cache stride). V is v_nope only, so V and O stay at d_pv = depth, identical to the MHA path β€” DeepSeek-V3 MLA does not RoPE V.

The MFMA-shape / SMEM-sub-block / K-loader / V-loader / PV-path machinery is shared with MhaPrefillV2 via MhaMmaOp[T, config.mha()]. mha() derives an MhaConfigV2 from Self for that sharing. MLA's divergence is the Q load at d_qk, the two K segments, and the cluster schedule that interleaves k_nope / k_rope DMAs with V.

The latent-cache layout (576-wide with k_rope at offset 512) is fixed by DeepSeek-V3; matches the existing BF16 MLA path in mla_prefill.mojo (cache_depth = 576, head_dim_offset = cache_depth - rope_depth = 512).

Fields​

  • ​q_block_size (Int): Q rows per warp.
  • ​kv_block (Int): K/V rows per tile (64 at d_pv=128).
  • ​depth (Int): V / O head depth (d_pv = d_nope). For DeepSeek-V3 MLA: 128.
  • ​num_heads (Int): Q num_heads.
  • ​num_kv_heads (Int): K/V num_heads. 1 (full GQA) or equal to num_heads (MHA); other ratios need a stride-aware DMA loader (TODO).
  • ​num_warps (Int): Warps per block.
  • ​rescale_threshold (Float32): Lazy-rescale threshold in log2 units of the running max (identical semantics to MhaConfigV2.rescale_threshold).
  • ​dtype (DType): Element dtype of Q / K (both q_nope βˆ₯ q_rope and k_nope βˆ₯ k_rope) / V input tiles. DType.bfloat16 for parity with the existing BF16 MLA prefill; DType.float8_e4m3fn for the FP8 MLA prefill path.
  • ​output_dtype (DType): Element dtype of the output tile o. FP32 by default; BF16 for inference dispatchers holding a BF16 output buffer. The cast from the FP32 accumulator happens per-lane inside the output store.
  • ​fp8_mma_k_128 (Bool): Mirror of MhaConfigV2.fp8_mma_k_128. Architecturally blocked for this attention path by the QK-output / PV-B-input lane geometry mismatch β€” kept for symmetry so MLA inherits the same comptime hook if a cross-lane shuffle becomes available.
  • ​d_qk (Int): Q / K depth (d_nope + d_rope). For DeepSeek-V3 MLA: 192.
  • ​d_rope (Int): RoPE-applied segment depth on Q and K. For DeepSeek-V3 MLA: 64.
  • ​cache_depth (Int): Latent K cache row width. For DeepSeek-V3 MLA: 576 β€” the gap between d_nope (128) and rope_cache_offset (512) is reserved / unused but present in the cache stride. Must match the production latent cache layout; see mla_prefill.mojo:54.
  • ​rope_cache_offset (Int): Column offset of k_rope within the latent cache row. For DeepSeek-V3 MLA: 512 (layout: k_nope at [:, :128], gap, k_rope at [:, 512:576]).

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable

Methods​

__init__​

def __init__(out self, *, q_block_size: Int, kv_block: Int, depth: Int, num_heads: Int, num_kv_heads: Int, d_qk: Int, d_rope: Int, cache_depth: Int, rope_cache_offset: Int, num_warps: Int = 8, rescale_threshold: Float32 = 8, dtype: DType = DType.bfloat16, output_dtype: DType = DType.float32, fp8_mma_k_128: Bool = False)

d_nope​

def d_nope(self) -> Int

Returns the non-RoPE segment depth (= depth = d_pv).

For DeepSeek-V3 MLA d_nope == depth == 128. Exposed as an accessor so MlaPrefillV2Core body code can reference the nope-segment depth by its semantic name without committing to an additional field.

Returns:

Int

mha​

def mha(self) -> MhaConfigV2

Returns an MhaConfigV2 derived from Self for sharing the MhaMmaOp[T, ...] machinery.

The MFMA shape, SMEM sub-block geometry, K loader, V loader, and PV path all live on MhaMmaOp β€” MLA's divergence is purely in MlaPrefillV2Core (Q load at d_qk, K_rope DMA, two-segment QK). The derived MhaConfigV2 carries depth = d_pv = d_nope; the MLA-specific d_qk / d_rope / cache_depth / rope_cache_offset fields stay on MlaConfigV2 only.

Returns:

MhaConfigV2