For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo module
amd_tile_io
TileTensor data movement and AMD GPU hardware operations.
Provides reusable building blocks for TileTensor-based DMA, LDS reads, and MMA operand loads on AMD CDNA GPUs (gfx950+).
Low-level LDS read primitives: ds_read_tr16_b64_row - 4x16 transposed LDS read (raw rocdl intrinsic). ds_read_tr16_b64_warp - Warp-level transposed LDS read. load_lds_fragment - Generic MFMA-fragment LDS read with swizzle.
DRAM→LDS cooperative DMA loaders (expert objects, structurally composed):
TileLoaderLDS - Warp-group cooperative, coord-indexed tile iteration
(half-tile BK-wide steps, per-iter swizzle). Matmul's
pattern. Uses stdlib AMDBufferResource.load_to_lds.
SubTileLoaderLDS - Single sub-tile DMA, TileTensor-indexed. Attention's
pattern. Uses rocdl.raw.ptr.buffer.load.lds with
the amdgpu.AsyncCopies alias scope so consumers
carrying noalias_scopes=_alias_scope_attr can skip
s_waitcnt vmcnt(0) (PR #74537).
SMEM→register MMA-fragment loader (expert object, static methods):
TiledMmaLoader - Sibling to TiledMmaOp. Parameterized by operand
dtype, MMA shape, and optional swizzle. Static
load_b, load_b_tr, load_v_fp8_strip methods
cover the B-operand and V-operand MFMA-fragment
load patterns (attention's QK / PV matmuls).
DRAM↔register loaders: RegTileLoader - AMD buffer-resource load from DRAM to registers. RegTileWriter - AMD buffer-resource store from registers to DRAM. Buffer-resource OOB clamping handles the M boundary cleanly but cannot distinguish row/col straddle, so use this only when N is BN-aligned and no fused lambda is needed. RegTileEpilogue - Per-lane epilogue writer with optional fused elementwise lambda. Caller passes (m_global, n_global) per call; the writer handles the fully-in-bounds chunk store, the partial-chunk-straddling-N per-element fallback, and the lambda dispatch. Use this for any kernel that needs to support N-misaligned shapes or a fused epilogue.
Register→LDS writer (expert object, static methods):
RegTileWriterLDS - Sibling to RegTileLoader / RegTileWriter.
Stateless; parameterized by thread_layout + swizzle.
.copy handles plain SMEM; .copy_blocked[block_cols] handles
the blocked_product-mismatched-layout case.
SMEM layout helpers: smem_subtile / smem_mma_subtile / smem_mma_subtile_offset - blocked SMEM navigation (TileTensor views + offset math).
comptime values
elementwise_epilogue_type
comptime elementwise_epilogue_type = def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None
Type alias for a fused elementwise epilogue lambda.
Local re-declaration of linalg.utils.elementwise_epilogue_type.
structured_kernels is a dependency of linalg, so we cannot
import the canonical definition without creating a cyclic bazel dep.
Mojo function-pointer types are structural, so this duplicate alias
is interchangeable with the canonical one at every call site that
hands a lambda across the package boundary.
GMemTile
comptime GMemTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin]
Global memory tile. Alias for TileTensor in default (GENERIC) address space.
Parameters
- mut (
Bool): - dtype (
DType): - LayoutType (
TensorLayout): - origin (
Origin[mut=mut]):
RegTile
comptime RegTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin, address_space=AddressSpace.LOCAL]
Register tile. Alias for TileTensor in LOCAL address space.
Parameters
- mut (
Bool): - dtype (
DType): - LayoutType (
TensorLayout): - origin (
Origin[mut=mut]):
SMemTile
comptime SMemTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin, address_space=AddressSpace.SHARED]
Shared memory tile. Alias for TileTensor in SHARED address space.
Parameters
- mut (
Bool): - dtype (
DType): - LayoutType (
TensorLayout): - origin (
Origin[mut=mut]):
Structs
-
RegTileEpilogue: Per-lane MFMA epilogue writer with optional fused elementwise lambda. -
RegTileLoader: AMD buffer-resource load from DRAM to registers. -
RegTileWriter: AMD buffer-resource store for writing register tiles to DRAM. -
RegTileWriterLDS: Stateless register→LDS copy expert. -
SubTileLoaderLDS: DRAM→LDS DMA expert for single-sub-tile TileTensor-indexed loads. -
SubTileLoaderLDS_HK_st_8x32: DRAM→LDS DMA for HK kittens'st_8x32_sSMEM layout (V operand). -
TiledMmaLoader: SMEM→register loader expert for MFMA operand fragments. -
TileLoaderLDS: DRAM→LDS DMA expert for warp-group cooperative coord-indexed loads.
Traits
-
TileLoader: DRAM→LDS DMA loader contract fortile_rows × tile_colshalf-tiles.
Functions
-
ds_read_b128_imm_u32x4: Issuesds_read_b128with a comptime immediate offset and returns the loaded 128 bits asSIMD[DType.uint32, 4]. -
ds_read_tr16_b64_row: 4x16 transposed LDS read via rocdl.ds.read.tr16.b64. -
ds_read_tr16_b64_warp: Warp-level transposed LDS read distributing across 16-lane rows. -
load_lds_fragment: Load MMA fragments from SMEM to registers using hardware access pattern. -
reg_alloc: Stack-allocate a register tile (LOCAL address space) with the given layout. -
smem_alloc: Stack-allocate a shared memory tile (SHARED address space) with the given layout. -
smem_mma_subtile: Creates a flat TileTensor for an MMA-sized sub-tile in blocked SMEM. -
smem_mma_subtile_offset: Element offset of an MMA sub-tile within a blocked (BN x BK) SMEM region. -
smem_subtile: Creates a flat TileTensor sub-view of a blocked SMEM layout.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!