For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

mla_decode_combine

MLA Decode Split-K Combine Kernel for SM100 (B200).

This kernel combines partial outputs from split-K attention computation. Each split computes attention over a portion of the KV cache. The combine kernel merges these partial results using LSE (Log-Sum-Exp) for numerical stability.

Algorithm:

Load partial LSE values for all splits
Compute global LSE: log2(sum(exp2(lse_i - max_lse))) + max_lse
Compute per-split scale factors: scale_i = exp2(lse_i - global_lse)
Weighted sum: output = sum(scale_i * partial_output_i)

Structs

CombineParams:
SplitParallelCombineParams:

Functions

launch_mla_combine_kernel:
launch_mla_combine_kernel_split_parallel:
mla_combine_kernel:
mla_combine_kernel_split_parallel: Split-parallel combine: 8 warps process different splits in parallel.
mla_decode_combine_partial_outputs:

Structs​

Functions​

Structs

Functions