Mojo module

hw_ops

TileTensor-native AMD GPU hardware operations for MHA.

Ports of the LayoutTensor-based HW load functions from amd/utils.mojo to TileTensor. These use new-style layouts (from tile_layout.mojo) for thread distribution and operate on TileTensor SMEM/register tiles.

Functions: ds_read_tr16_b64_row — 4×16 transposed LDS read (raw rocdl intrinsic) ds_read_tr16_b64_warp — warp-level transposed LDS read tt_load_b_tr — transposed B operand load (split into halves) tt_load_b_tile — single MMA tile load from SMEM with swizzle tt_load_b — full B operand load from SMEM warp tile tt_copy_dram_to_sram_lds — fully TileTensor DMA (both dst and src)

Functions

ds_read_tr16_b64_row: 4×16 transposed LDS read via rocdl.ds.read.tr16.b64.
ds_read_tr16_b64_warp: Warp-level transposed LDS read distributing across 16-lane rows.
tt_copy_dram_to_sram_lds: DMA from DRAM to LDS with TileTensor for both dst and src.
tt_load_b: Full B operand load from a SMEM warp tile.
tt_load_b_tile: Single MMA tile load from SMEM with optional swizzle.
tt_load_b_tr: Transposed B operand load for double-rate MFMA shapes.

Functions​

Functions