Mojo module
mha_decode_streaming
MHA streaming decode kernel for gfx950.
Per-tile loop: K strips from DRAM→LDS→REG for QK MMA, P scores through SMEM for PV MMA, split-K partitioning.
Uses DecodeStreamingKVBuffer for single-buffer, per-strip DRAM→SMEM staging (no KVCacheIterator — strips are sub-tiled from an external DRAM tile).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!