IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

amd_tile_io_conv

DRAM->LDS DMA loader for AMD implicit-GEMM convolution.

TileLoaderLDSIm2col is a sibling expert of structured_kernels.amd_tile_io.TileLoaderLDS under the same DRAM→LDS warp-cooperative direction. The split parallels the existing TileLoaderLDS (linear 2D source, coord-indexed) vs SubTileLoaderLDS (single sub-tile, TileTensor-indexed, async-copies alias scope) — same direction, different cooperation or source shape. Here the differentiator is the source layout type: this loader accepts a 4D NHWC input tensor and (m_offset, k_offset) coordinates in GEMM space (where M = N*H_out*W_out and K = R*S*C), translating them to NHWC linear offsets at load time. The per-lane vs uniform split of each address is preserved so each iteration still issues a single buffer_load_*_lds per lane, matching the vmcnt accounting amd_4wave_matmul's schedule relies on.

For each lane each iteration, load_tile computes

(n, h_out, w_out) = decompose(m_offset + lane_offset, H_out, W_out)
(kh, kw, c)       = decompose(k_offset + lane_offset, R, S, C)
h_in              = h_out * stride_h + kh * dilation_h - pad_h
w_in              = w_out * stride_w + kw * dilation_w - pad_w
addr              = ((n*H + h_in)*W + w_in)*C + c

and emits one buffer_load_*_lds per lane targeting addr. Three comptime sub-paths inside the body, picked at struct instantiation:

  • Pure-pointwise fast path (R=S=1, stride=1, dilation=1, pad=0): math collapses to m*C + k, identical to TileLoaderLDS(stride = C). Same instruction stream as the matmul loader; same VGPR usage.
  • Uniform-substrip path (general R×S, tile_cols ≤ C and C % tile_cols == 0): each load_tile call's lanes all sit in one (kh, kw) substrip; (kh, kw, c_base) are computed once per call from k_offset. Address-decomp adds the per-lane (n, h_out, w_out) divmods.
  • Per-lane substrip path (BK > C or non-aligned, e.g. ResNet stem with C_in = 64 and BK = 128): each lane independently decomposes its k_lane = k_offset + thread_col. Adds two more divmods per lane but unblocks shapes the fast path can't cover.

Halo (pad > 0) is handled by routing OOB lanes (h_in or w_in outside [0, H)/[0, W)) to the SRD's num_records sentinel — the AMD buffer-resource hardware bound check then clamps the read to 0. Gated by a _needs_halo_mask comptime flag so pad=0 callers keep their exact instruction stream.

The SRD covers the full NHWC tensor (N*H*W*C elements), not a per-block slice — so OOB reads (deliberate halo or otherwise) are bounded by the actual allocation, in contrast to TileLoaderLDS which sizes the SRD to the per-block .tile[BM, K]() view. The full-tensor pattern caps at 4 GiB (AMDBufferResource.num_records is a 32-bit byte count); fine for typical NHWC conv inputs but watch the cap when NHW*C grows.

K-padding is the caller's responsibility (the 4-wave schedule requires K_filter % (2*BK) == 0). Set C to the real input channel count and pre-pad the filter's trailing K columns with zeros; the address math naturally handles the padded K reads via SRD-OOB or filter-zero multiplies in the MMA.

Structs