For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

amd_tile_io

TileTensor data movement and AMD GPU hardware operations.

Provides reusable building blocks for TileTensor-based DMA, LDS reads, and MMA operand loads on AMD CDNA GPUs (gfx950+).

Low-level LDS read primitives: ds_read_tr16_b64_row - 4x16 transposed LDS read (raw rocdl intrinsic). ds_read_tr16_b64_warp - Warp-level transposed LDS read. load_lds_fragment - Generic MFMA-fragment LDS read with swizzle.

DRAM→LDS cooperative DMA loaders (expert objects, structurally composed): TileLoaderLDS - Warp-group cooperative, coord-indexed tile iteration (half-tile BK-wide steps, per-iter swizzle). Matmul's pattern. Uses stdlib AMDBufferResource.load_to_lds. SubTileLoaderLDS - Single sub-tile DMA, TileTensor-indexed. Attention's pattern. Uses rocdl.raw.ptr.buffer.load.lds with the amdgpu.AsyncCopies alias scope so consumers carrying noalias_scopes=_alias_scope_attr can skip s_waitcnt vmcnt(0) (PR #74537).

SMEM→register MMA-fragment loader (expert object, static methods): TiledMmaLoader - Sibling to TiledMmaOp. Parameterized by operand dtype, MMA shape, and optional swizzle. Static load_b, load_b_tr, load_v_fp8_strip methods cover the B-operand and V-operand MFMA-fragment load patterns (attention's QK / PV matmuls).

DRAM↔register loaders: RegTileLoader - AMD buffer-resource load from DRAM to registers. RegTileWriter - AMD buffer-resource store from registers to DRAM. Buffer-resource OOB clamping handles the M boundary cleanly but cannot distinguish row/col straddle, so use this only when N is BN-aligned and no fused lambda is needed. RegTileEpilogue - Per-lane epilogue writer with optional fused elementwise lambda. Caller passes (m_global, n_global) per call; the writer handles the fully-in-bounds chunk store, the partial-chunk-straddling-N per-element fallback, and the lambda dispatch. Use this for any kernel that needs to support N-misaligned shapes or a fused epilogue.

Register→LDS writer (expert object, static methods): RegTileWriterLDS - Sibling to RegTileLoader / RegTileWriter. Stateless; parameterized by thread_layout + swizzle. .copy handles plain SMEM; .copy_blocked[block_cols] handles the blocked_product-mismatched-layout case.

SMEM layout helpers: smem_subtile / smem_mma_subtile / smem_mma_subtile_offset - blocked SMEM navigation (TileTensor views + offset math).

`comptime` values

`elementwise_epilogue_type`

comptime elementwise_epilogue_type = def[dtype: DType, width: SIMDSize, *, alignment: Int = Int(1)](IndexList[Int(2)], SIMD[dtype, width]) capturing thin -> None

Type alias for a fused elementwise epilogue lambda.

Local re-declaration of linalg.utils.elementwise_epilogue_type. structured_kernels is a dependency of linalg, so we cannot import the canonical definition without creating a cyclic bazel dep. Mojo function-pointer types are structural, so this duplicate alias is interchangeable with the canonical one at every call site that hands a lambda across the package boundary.

`GMemTile`

comptime GMemTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin]

Global memory tile. Alias for TileTensor in default (GENERIC) address space.

Parameters

mut (Bool):
dtype (DType):
LayoutType (TensorLayout):
origin (Origin[mut=mut]):

`RegTile`

comptime RegTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin, address_space=AddressSpace.LOCAL]

Parameters

mut (Bool):
dtype (DType):
LayoutType (TensorLayout):
origin (Origin[mut=mut]):

`SMemTile`

comptime SMemTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin, address_space=AddressSpace.SHARED]

Shared memory tile. Alias for TileTensor in SHARED address space.

Parameters

mut (Bool):
dtype (DType):
LayoutType (TensorLayout):
origin (Origin[mut=mut]):

Structs

RegTileEpilogue: Per-lane MFMA epilogue writer with optional fused elementwise lambda.
RegTileLoader: AMD buffer-resource load from DRAM to registers.
RegTileWriter: AMD buffer-resource store for writing register tiles to DRAM.
RegTileWriterLDS: Stateless register→LDS copy expert.
SubTileLoaderLDS: DRAM→LDS DMA expert for single-sub-tile TileTensor-indexed loads.
SubTileLoaderLDS_st_8x32: DRAM→LDS DMA for the reference st_8x32_s SMEM layout (V operand).
TiledMmaLoader: SMEM→register loader expert for MFMA operand fragments.
TileLoaderLDS: DRAM→LDS DMA expert for warp-group cooperative coord-indexed loads.

Traits

TileLoader: DRAM→LDS DMA loader contract for tile_rows × tile_cols half-tiles.

Functions

ds_read_b128_imm_u32x4: Issues ds_read_b128 with a comptime immediate offset and returns the loaded 128 bits as SIMD[DType.uint32, 4].
ds_read_tr16_b64_row: 4x16 transposed LDS read via rocdl.ds.read.tr16.b64.
ds_read_tr16_b64_warp: Warp-level transposed LDS read distributing across 16-lane rows.
load_lds_fragment: Load MMA fragments from SMEM to registers using hardware access pattern.
reg_alloc: Stack-allocate a register tile (LOCAL address space) with the given layout.
smem_alloc: Stack-allocate a shared memory tile (SHARED address space) with the given layout.
smem_mma_subtile: Creates a flat TileTensor for an MMA-sized sub-tile in blocked SMEM.
smem_mma_subtile_offset: Element offset of an MMA sub-tile within a blocked (BN x BK) SMEM region.
smem_subtile: Creates a flat TileTensor sub-view of a blocked SMEM layout.

comptime values​

elementwise_epilogue_type​

GMemTile​

Parameters​

RegTile​

Parameters​

SMemTile​

Parameters​

Structs​

Traits​

Functions​

`comptime` values

`elementwise_epilogue_type`

`GMemTile`

Parameters

`RegTile`

Parameters

`SMemTile`

Parameters

Structs

Traits

Functions