Mojo module
mha_structured
TileTensor-based MHA prefill kernel for gfx950.
Uses TileTensor instead of LayoutTensor for the KV buffer layer, eliminating RuntimeLayout overhead and reducing VGPR usage by 2 (254 -> 252), which gives the LLVM scheduler more freedom to produce better instruction ordering. Supports depth=64, 128, 256.
Functionsβ
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!