Skip to main content

Mojo module

mha_structured

TileTensor-based MHA prefill kernel for gfx950.

Uses TileTensor instead of LayoutTensor for the KV buffer layer, eliminating RuntimeLayout overhead and reducing VGPR usage by 2 (254 -> 252), which gives the LLVM scheduler more freedom to produce better instruction ordering. Supports depth=64, 128, 256.

Functions​

Was this page helpful?