For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo module
amd_tile_io_conv
DRAM->LDS DMA loader for AMD implicit-GEMM convolution.
TileLoaderLDSIm2col is a sibling expert of
structured_kernels.amd_tile_io.TileLoaderLDS under the same
DRAM→LDS warp-cooperative direction. The split parallels the
existing TileLoaderLDS (linear 2D source, coord-indexed) vs
SubTileLoaderLDS (single sub-tile, TileTensor-indexed, async-copies
alias scope) — same direction, different cooperation or source
shape. Here the differentiator is the source layout type: this
loader accepts a 4D NHWC input tensor and (m_offset, k_offset)
coordinates in GEMM space (where M = N*H_out*W_out and
K = R*S*C), translating them to NHWC linear offsets at load time.
The per-lane vs uniform split of each address is preserved so each
iteration still issues a single buffer_load_*_lds per lane,
matching the vmcnt accounting amd_4wave_matmul's schedule relies on.
For each lane each iteration, load_tile computes
(n, h_out, w_out) = decompose(m_offset + lane_offset, H_out, W_out)
(kh, kw, c) = decompose(k_offset + lane_offset, R, S, C)
h_in = h_out * stride_h + kh * dilation_h - pad_h
w_in = w_out * stride_w + kw * dilation_w - pad_w
addr = ((n*H + h_in)*W + w_in)*C + cand emits one buffer_load_*_lds per lane targeting addr. Three
comptime sub-paths inside the body, picked at struct instantiation:
- Pure-pointwise fast path (R=S=1, stride=1, dilation=1,
pad=0): math collapses to
m*C + k, identical toTileLoaderLDS(stride = C). Same instruction stream as the matmul loader; same VGPR usage. - Uniform-substrip path (general R×S,
tile_cols ≤ CandC % tile_cols == 0): eachload_tilecall's lanes all sit in one(kh, kw)substrip;(kh, kw, c_base)are computed once per call fromk_offset. Address-decomp adds the per-lane (n, h_out, w_out) divmods. - Per-lane substrip path (BK > C or non-aligned, e.g. ResNet
stem with C_in = 64 and BK = 128): each lane independently
decomposes its
k_lane = k_offset + thread_col. Adds two more divmods per lane but unblocks shapes the fast path can't cover.
Halo (pad > 0) is handled by routing OOB lanes (h_in or w_in outside
[0, H)/[0, W)) to the SRD's num_records sentinel — the AMD
buffer-resource hardware bound check then clamps the read to 0.
Gated by a _needs_halo_mask comptime flag so pad=0 callers keep
their exact instruction stream.
The SRD covers the full NHWC tensor (N*H*W*C elements), not a
per-block slice — so OOB reads (deliberate halo or otherwise) are
bounded by the actual allocation, in contrast to TileLoaderLDS
which sizes the SRD to the per-block .tile[BM, K]() view. The
full-tensor pattern caps at 4 GiB (AMDBufferResource.num_records is
a 32-bit byte count); fine for typical NHWC conv inputs but watch
the cap when NHW*C grows.
K-padding is the caller's responsibility (the 4-wave schedule
requires K_filter % (2*BK) == 0). Set C to the real input
channel count and pre-pad the filter's trailing K columns with
zeros; the address math naturally handles the padded K reads via
SRD-OOB or filter-zero multiplies in the MMA.
Structs
-
TileLoaderLDSIm2col: DRAM->LDS DMA expert for implicit-GEMM convolution, NHWC inputs.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!