IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

kernel

Kernel entry point for depth=256/512 pair-CTA SM100 (Blackwell) MHA prefill.

Two neighboring SMs cooperate via pair-CTA MMA (cta_group=2, cluster_shape=(2,1,1)).

Depth-dependent geometry: depth=512: MMA_M=128, BM=64, BN=256. O split into O_lo/O_hi. depth=256: MMA_M=256, BM=128, BN=128. Single O accumulator.

Warp assignment (384 threads = 12 warps, 3 warp groups of 128): Warps 0-3: Softmax (warp group 0) Warps 4-7: Correction (warp group 1) Warp 8: MMA (leader CTA issues pair-CTA MMA; peer early-returns) Warp 9: Load (both CTAs issue TMA multicast; leader calls expect_bytes) Warps 10-11: Spare (no-op)

Structs