Mojo module

mha_structured

TileTensor-based MHA prefill kernel for gfx950.

Uses TileTensor instead of LayoutTensor for the KV buffer layer, eliminating RuntimeLayout overhead and reducing VGPR usage by 2 (254 -> 252), which gives the LLVM scheduler more freedom to produce better instruction ordering. Supports depth=64, 128, 256.

Functions

barrier:
block_sync_lds_direct_load:
set_priority:

Functions

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

Functions​

Functions