For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

MlaConfigV2

struct MlaConfigV2

Shape configuration for MlaPrefillV2Core. Companion to MhaConfigV2.

DeepSeek-V3-style MLA: Q is concatenated q_nope || q_rope at d_qk = d_nope + d_rope. K is stored in a latent cache at cache_depth columns wide, with k_nope at [:, :d_nope] and k_rope at [:, rope_cache_offset:rope_cache_offset + d_rope] (the gap between the two segments is padded / reserved but counted in the cache stride). V is v_nope only, so V and O stay at d_pv = depth, identical to the MHA path — DeepSeek-V3 MLA does not RoPE V.

The MFMA-shape / SMEM-sub-block / K-loader / V-loader / PV-path machinery is shared with MhaPrefillV2 via MhaMmaOp[T, config.mha()]. mha() derives an MhaConfigV2 from Self for that sharing. MLA's divergence is the Q load at d_qk, the two K segments, and the cluster schedule that interleaves k_nope / k_rope DMAs with V.

The latent-cache layout (576-wide with k_rope at offset 512) is fixed by DeepSeek-V3; matches the existing BF16 MLA path in mla_prefill.mojo (cache_depth = 576, head_dim_offset = cache_depth - rope_depth = 512).

Fields

q_block_size (Int): Q rows per warp.
kv_block (Int): K/V rows per tile (64 at d_pv=128).
depth (Int): V / O head depth (d_pv = d_nope). For DeepSeek-V3 MLA: 128.
num_heads (Int): Q num_heads.
num_kv_heads (Int): K/V num_heads. 1 (full GQA) or equal to num_heads (MHA); other ratios need a stride-aware DMA loader (TODO).
num_warps (Int): Warps per block.
rescale_threshold (Float32): Lazy-rescale threshold in log2 units of the running max (identical semantics to MhaConfigV2.rescale_threshold).
dtype (DType): Element dtype of Q / K (both q_nope ∥ q_rope and k_nope ∥ k_rope) / V input tiles. DType.bfloat16 for parity with the existing BF16 MLA prefill; DType.float8_e4m3fn for the FP8 MLA prefill path.
output_dtype (DType): Element dtype of the output tile o. FP32 by default; BF16 for inference dispatchers holding a BF16 output buffer. The cast from the FP32 accumulator happens per-lane inside the output store.
fp8_mma_k_128 (Bool): Mirror of MhaConfigV2.fp8_mma_k_128. Architecturally blocked for this attention path by the QK-output / PV-B-input lane geometry mismatch — kept for symmetry so MLA inherits the same comptime hook if a cross-lane shuffle becomes available.
d_qk (Int): Q / K depth (d_nope + d_rope). For DeepSeek-V3 MLA: 192.
d_rope (Int): RoPE-applied segment depth on Q and K. For DeepSeek-V3 MLA: 64.
cache_depth (Int): Latent K cache row width. For DeepSeek-V3 MLA: 576 — the gap between d_nope (128) and rope_cache_offset (512) is reserved / unused but present in the cache stride. Must match the production latent cache layout; see mla_prefill.mojo:54.
rope_cache_offset (Int): Column offset of k_rope within the latent cache row. For DeepSeek-V3 MLA: 512 (layout: k_nope at [:, :128], gap, k_rope at [:, 512:576]).

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable

Methods

`init`

def __init__(out self, *, q_block_size: Int, kv_block: Int, depth: Int, num_heads: Int, num_kv_heads: Int, d_qk: Int, d_rope: Int, cache_depth: Int, rope_cache_offset: Int, num_warps: Int = Int(8), rescale_threshold: Float32 = 8, dtype: DType = DType.bfloat16, output_dtype: DType = DType.float32, fp8_mma_k_128: Bool = False)

`d_nope`

def d_nope(self) -> Int

Returns the non-RoPE segment depth (= depth = d_pv).

For DeepSeek-V3 MLA d_nope == depth == 128. Exposed as an accessor so MlaPrefillV2Core body code can reference the nope-segment depth by its semantic name without committing to an additional field.

Returns:

Int

`mha`

def mha(self) -> MhaConfigV2

Returns an MhaConfigV2 derived from Self for sharing the MhaMmaOp[T, ...] machinery.

The MFMA shape, SMEM sub-block geometry, K loader, V loader, and PV path all live on MhaMmaOp — MLA's divergence is purely in MlaPrefillV2Core (Q load at d_qk, K_rope DMA, two-segment QK). The derived MhaConfigV2 carries depth = d_pv = d_nope; the MLA-specific d_qk / d_rope / cache_depth / rope_cache_offset fields stay on MlaConfigV2 only.

Returns:

MhaConfigV2

Fields​

Implemented traits​

Methods​

__init__​

d_nope​

mha​