Mojo module
hw_ops
TileTensor-native AMD GPU hardware operations for MHA.
Ports of the LayoutTensor-based HW load functions from amd/utils.mojo to TileTensor. These use new-style layouts (from tile_layout.mojo) for thread distribution and operate on TileTensor SMEM/register tiles.
Functions: ds_read_tr16_b64_row — 4×16 transposed LDS read (raw rocdl intrinsic) ds_read_tr16_b64_warp — warp-level transposed LDS read tt_load_b_tr — transposed B operand load (split into halves) tt_load_b_tile — single MMA tile load from SMEM with swizzle tt_load_b — full B operand load from SMEM warp tile tt_copy_dram_to_sram_lds — fully TileTensor DMA (both dst and src)
Functions
-
ds_read_tr16_b64_row: 4×16 transposed LDS read via rocdl.ds.read.tr16.b64. -
ds_read_tr16_b64_warp: Warp-level transposed LDS read distributing across 16-lane rows. -
tt_copy_dram_to_sram_lds: DMA from DRAM to LDS with TileTensor for both dst and src. -
tt_load_b: Full B operand load from a SMEM warp tile. -
tt_load_b_tile: Single MMA tile load from SMEM with optional swizzle. -
tt_load_b_tr: Transposed B operand load for double-rate MFMA shapes.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!