IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

amd_tile_io

TileTensor data movement and AMD GPU hardware operations.

Provides reusable building blocks for TileTensor-based DMA, LDS reads, and MMA operand loads on AMD CDNA GPUs (gfx950+).

Low-level LDS read primitives: ds_read_tr16_b64_row - 4x16 transposed LDS read (raw rocdl intrinsic). ds_read_tr16_b64_warp - Warp-level transposed LDS read. load_lds_fragment - Generic MFMA-fragment LDS read with swizzle.

DRAM→LDS cooperative DMA loaders (expert objects, structurally composed): TileLoaderLDS - Warp-group cooperative, coord-indexed tile iteration (half-tile BK-wide steps, per-iter swizzle). Matmul's pattern. Uses stdlib AMDBufferResource.load_to_lds. SubTileLoaderLDS - Single sub-tile DMA, TileTensor-indexed. Attention's pattern. Uses rocdl.raw.ptr.buffer.load.lds with the amdgpu.AsyncCopies alias scope so consumers carrying noalias_scopes=_alias_scope_attr can skip s_waitcnt vmcnt(0) (PR #74537).

SMEM→register MMA-fragment loader (expert object, static methods): TiledMmaLoader - Sibling to TiledMmaOp. Parameterized by operand dtype, MMA shape, and optional swizzle. Static load_b, load_b_tr, load_v_fp8_strip methods cover the B-operand and V-operand MFMA-fragment load patterns (attention's QK / PV matmuls).

DRAM↔register loaders: RegTileLoader - AMD buffer-resource load from DRAM to registers. RegTileWriter - AMD buffer-resource store from registers to DRAM. Buffer-resource OOB clamping handles the M boundary cleanly but cannot distinguish row/col straddle, so use this only when N is BN-aligned and no fused lambda is needed. RegTileEpilogue - Per-lane epilogue writer with optional fused elementwise lambda. Caller passes (m_global, n_global) per call; the writer handles the fully-in-bounds chunk store, the partial-chunk-straddling-N per-element fallback, and the lambda dispatch. Use this for any kernel that needs to support N-misaligned shapes or a fused epilogue.

Register→LDS writer (expert object, static methods): RegTileWriterLDS - Sibling to RegTileLoader / RegTileWriter. Stateless; parameterized by thread_layout + swizzle. .copy handles plain SMEM; .copy_blocked[block_cols] handles the blocked_product-mismatched-layout case.

SMEM layout helpers: smem_subtile / smem_mma_subtile / smem_mma_subtile_offset - blocked SMEM navigation (TileTensor views + offset math).

comptime values

elementwise_epilogue_type

comptime elementwise_epilogue_type = def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None

Type alias for a fused elementwise epilogue lambda.

Local re-declaration of linalg.utils.elementwise_epilogue_type. structured_kernels is a dependency of linalg, so we cannot import the canonical definition without creating a cyclic bazel dep. Mojo function-pointer types are structural, so this duplicate alias is interchangeable with the canonical one at every call site that hands a lambda across the package boundary.

GMemTile

comptime GMemTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin]

Global memory tile. Alias for TileTensor in default (GENERIC) address space.

Parameters

RegTile

comptime RegTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin, address_space=AddressSpace.LOCAL]

Register tile. Alias for TileTensor in LOCAL address space.

Parameters

SMemTile

comptime SMemTile[mut: Bool, //, dtype: DType, LayoutType: TensorLayout, origin: Origin[mut=mut]] = TileTensor[dtype, LayoutType, origin, address_space=AddressSpace.SHARED]

Shared memory tile. Alias for TileTensor in SHARED address space.

Parameters

Structs

Traits

  • TileLoader: DRAM→LDS DMA loader contract for tile_rows × tile_cols half-tiles.

Functions