Skip to main content

Mojo module

mha_decode_streaming

MHA streaming decode kernel for gfx950.

Per-tile loop: K strips from DRAM→LDS→REG for QK MMA, P scores through SMEM for PV MMA, split-K partitioning.

Uses DecodeStreamingKVBuffer for single-buffer, per-strip DRAM→SMEM staging (no KVCacheIterator — strips are sub-tiled from an external DRAM tile).

Was this page helpful?