For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

TileLoaderLDSIm2col

struct TileLoaderLDSIm2col[dtype_: DType, tile_rows_: Int, tile_cols_: Int, C: Int, num_loading_warps: Int, H: Int = Int(1), W: Int = Int(1), H_out: Int = Int(1), W_out: Int = Int(1), R: Int = Int(1), S: Int = Int(1), stride_h: Int = Int(1), stride_w: Int = Int(1), dilation_h: Int = Int(1), dilation_w: Int = Int(1), pad_h: Int = Int(0), pad_w: Int = Int(0), Q: Int = Int(1), D: Int = Int(1), D_out: Int = Int(1), stride_d: Int = Int(1), dilation_d: Int = Int(1), pad_d: Int = Int(0), swizzle: Optional[Swizzle] = Optional(), load_width: Int = simd_width_of[dtype_](), use_full_tile_width: Bool = False, use_runtime_hw: Bool = False]

DRAM->LDS DMA expert for implicit-GEMM convolution, NHWC inputs.

Sibling of TileLoaderLDS (linear GEMM source). Each iteration issues one buffer_load_*_lds per lane, same vmcnt cost as the matmul loader. The kernel's K-loop iterates flat k_offset ∈ [0, R*S*C) in steps of tile_cols; the loader internally decomposes k_offset → (kh, kw, c_offset) and per-lane m_lane → (n, h_out, w_out), then computes addr = ((n*H + h_in)*W + w_in)*C + c for each lane.

The body picks one of three comptime sub-paths at instantiation: pure-pointwise (R=S=1, no pad — math collapses to m*C + k); uniform-substrip (general R×S with tile_cols ≤ C and C % tile_cols == 0 — one (kh, kw) per call); per-lane substrip (otherwise — each lane decomposes its own k_lane). Pad > 0 additionally routes halo lanes (h_in or w_in outside [0, H)/[0, W)) to the SRD-OOB sentinel.

Parameters

dtype_ (DType): Element data type. Re-bound to dtype at body scope to match the TileLoader trait alias.
tile_rows_ (Int): Height of each half-tile to load (in M = NH_outW_out space). Re-bound to tile_rows at body scope.
tile_cols_ (Int): Width of each half-tile (in K = RSC space). Re-bound to tile_cols at body scope. Must satisfy tile_cols ≤ C and C % tile_cols == 0 so each load_tile call lives inside one (kh, kw) substrip.
C (Int): Input channel count.
num_loading_warps (Int): Warps cooperating on each load.
H (Int): Input spatial height.
W (Int): Input spatial width.
H_out (Int): Output spatial height (with stride=1, dilation=1, no pad: H - R + 1).
W_out (Int): Output spatial width.
R (Int): Filter height.
S (Int): Filter width.
stride_h (Int): Vertical conv stride (>= 1).
stride_w (Int): Horizontal conv stride (>= 1).
dilation_h (Int): Vertical conv dilation (>= 1).
dilation_w (Int): Horizontal conv dilation (>= 1).
pad_h (Int): Vertical pad (>= 0). Halo lanes route to the SRD-OOB sentinel when pad > 0.
pad_w (Int): Horizontal pad (>= 0).
Q (Int): Filter temporal extent (3D-only). Q == 1 (default) keeps the loader in 2D mode (4D NHWC input). Q > 1 activates 3D mode (5D NDHWC input, K = QRS*C).
D (Int): Input temporal depth (3D-only; unused when Q == 1).
D_out (Int): Output temporal depth (3D-only).
stride_d (Int): Temporal conv stride (3D-only, >= 1).
dilation_d (Int): Temporal conv dilation (3D-only, >= 1).
pad_d (Int): Temporal pad (3D-only, >= 0). Halo lanes route to the SRD-OOB sentinel when pad_d > 0.
swizzle (Optional[Swizzle]): Optional byte-space swizzle for LDS bank conflicts.
load_width (Int): Elements per load (SIMD width).
use_full_tile_width (Bool): FP8 row-major mode (matches TileLoaderLDS.use_full_tile_width).
use_runtime_hw (Bool): When True, H/W/H_out/W_out (and D/D_out in 3D mode) come from runtime constructor args instead of the comptime template params above. Used for graph-compiled callers with dynamic image resolution (e.g. FLUX VAE). The K-decomposition and conv params (Q, R, S, stride, dilation, pad) stay comptime.

Fields

buffer (AMDBufferResource):
thread_row (Int):
thread_col (Int):
warp_id (Int):
lane_id (Int):
num_records (Int):
m_anchor (Int):
k_anchor (Int):
rt_h (Int):
rt_w (Int):
rt_h_out (Int):
rt_w_out (Int):
rt_spatial (Int):
rt_d (Int):
rt_d_out (Int):
rt_spatial_dhw (Int):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TileLoader, TrivialRegisterPassable

`comptime` members

`dtype`

comptime dtype = dtype_

`elements_per_warp`

comptime elements_per_warp = (_resolve_warp_size() * load_width)

`lane_load_bytes`

comptime lane_load_bytes = (load_width * size_of[dtype_]())

`loading_threads`

comptime loading_threads = (num_loading_warps * _resolve_warp_size())

`num_iterations`

comptime num_iterations = ceildiv(tile_rows_, (Int((mul _resolve_warp_size(), num_loading_warps)) // (tile_cols_ // load_width)))

`num_warp_cols`

comptime num_warp_cols = (tile_cols_ // tile_cols_ if use_full_tile_width else Int(32))

`num_warp_rows`

comptime num_warp_rows = (num_loading_warps // (tile_cols_ // tile_cols_ if use_full_tile_width else Int(32)))

`row_bytes`

comptime row_bytes = (tile_cols_ * size_of[dtype_]())

`rows_per_iteration`

comptime rows_per_iteration = (Int((mul _resolve_warp_size(), num_loading_warps)) // (tile_cols_ // load_width))

`rows_per_warp`

comptime rows_per_warp = (Int((mul _resolve_warp_size(), load_width)) // tile_cols_)

`subtile_cols`

comptime subtile_cols = TileLoaderLDSIm2col[dtype_, tile_rows_, tile_cols_, C, num_loading_warps, H, W, H_out, W_out, R, S, stride_h, stride_w, dilation_h, dilation_w, pad_h, pad_w, Q, D, D_out, stride_d, dilation_d, pad_d, swizzle, load_width, use_full_tile_width, use_runtime_hw].tile_cols if use_full_tile_width else Int(32)

`thread_rows`

comptime thread_rows = (_resolve_warp_size() // (tile_cols_ if use_full_tile_width else Int(32) // load_width))

`threads_per_row`

comptime threads_per_row = (tile_cols_ if use_full_tile_width else Int(32) // load_width)

`tile_cols`

comptime tile_cols = tile_cols_

`tile_rows`

comptime tile_rows = tile_rows_

`total_warp_rows`

comptime total_warp_rows = ceildiv(tile_rows_, (Int((mul _resolve_warp_size(), load_width)) // tile_cols_))

`warp_subtile_bytes`

comptime warp_subtile_bytes = (Int((mul (Int((mul _resolve_warp_size(), load_width)) // tile_cols_), tile_cols_)) * size_of[dtype_]())

Methods

`init`

def __init__[InLayout: TensorLayout](src_nhwc: TileTensor[Self.dtype, InLayout], warp_id: Int, lane_id: Int, *, m_anchor: Int = Int(0), k_anchor: Int = Int(0)) -> Self

Builds the loader from a 4D NHWC input TileTensor.

The SRD covers the entire NHWC tensor (N*H*W*C elements). Per-block addressing is split between m_anchor/k_anchor (per-block origin in GEMM space, set at construction) and the m_offset/k_offset args of load_tile (within-block).

This overload is for the comptime-HW path (the default); the runtime conv geometry fields are populated with zeros and the loader uses the comptime template params instead.

Args:

src_nhwc (TileTensor[Self.dtype, InLayout]): 4D NHWC input tensor of shape (N, H, W, C).
warp_id (Int): Warp identifier within the loading warp group.
lane_id (Int): Lane identifier within the warp.
m_anchor (Int): M-coordinate (= flat NH_outW_out index) of the block origin. Added to m_offset at load time. Defaults to 0 — pass per-block origin from the kernel.
k_anchor (Int): K-coordinate (= flat (kh, kw, c) index) of the block origin. Added to k_offset at load time. Defaults to 0 — conv split-K is not yet supported, so callers typically leave this at the default.

def __init__[InLayout: TensorLayout](src_nhwc: TileTensor[Self.dtype, InLayout], warp_id: Int, lane_id: Int, *, runtime_h: Int, runtime_w: Int, runtime_h_out: Int, runtime_w_out: Int, m_anchor: Int = Int(0), k_anchor: Int = Int(0)) -> Self

Runtime-HW overload: H/W/H_out/W_out from runtime args.

Equivalent to the comptime-HW overload except the conv input / output spatial dims are runtime values (typically read from input.dim() / output.dim() by the launcher). Use when the graph compiler can't pin the resolution.

Args:

src_nhwc (TileTensor[Self.dtype, InLayout]): 4D NHWC input tensor of shape (N, H, W, C).
warp_id (Int): Warp identifier within the loading warp group.
lane_id (Int): Lane identifier within the warp.
runtime_h (Int): Runtime input height.
runtime_w (Int): Runtime input width.
runtime_h_out (Int): Runtime output height (must equal `(runtime_h
- 2pad_h - dilation_h(R-1) - 1) // stride_h + 1`).
runtime_w_out (Int): Runtime output width.
m_anchor (Int): M-coordinate of the block origin. Added to m_offset at load time. Defaults to 0.
k_anchor (Int): K-coordinate of the block origin. Added to k_offset at load time. Defaults to 0.

def __init__[InLayout: TensorLayout](src_ndhwc: TileTensor[Self.dtype, InLayout], warp_id: Int, lane_id: Int, *, runtime_d: Int, runtime_h: Int, runtime_w: Int, runtime_d_out: Int, runtime_h_out: Int, runtime_w_out: Int, m_anchor: Int = Int(0), k_anchor: Int = Int(0)) -> Self

3D runtime-HW overload: D/H/W/D_out/H_out/W_out from runtime args.

Equivalent to the 2D runtime-HW overload but for Q > 1 mode: accepts a rank-5 NDHWC TileTensor and runtime D / D_out args in addition to the spatial H/W ones. The K-decomposition and conv params (Q, R, S, stride_d, stride_h, stride_w, dilation_, pad_, C) stay comptime.

Args:

src_ndhwc (TileTensor[Self.dtype, InLayout]): 5D NDHWC input tensor of shape (N, D, H, W, C).
warp_id (Int): Warp identifier within the loading warp group.
lane_id (Int): Lane identifier within the warp.
runtime_d (Int): Runtime input depth.
runtime_h (Int): Runtime input height.
runtime_w (Int): Runtime input width.
runtime_d_out (Int): Runtime output depth (must equal (runtime_d + 2*pad_d - dilation_d*(Q-1) - 1) // stride_d + 1).
runtime_h_out (Int): Runtime output height.
runtime_w_out (Int): Runtime output width.
m_anchor (Int): M-coordinate of the block origin in GEMM space (= flat N*D_out*H_out*W_out index). Added to m_offset at load time. Defaults to 0.
k_anchor (Int): K-coordinate of the block origin in GEMM space (= flat Q*R*S*C index). Added to k_offset at load time. Defaults to 0.

`load_tile`

def load_tile(self, dst: TileTensor[Self.dtype, address_space=AddressSpace.SHARED], m_offset: Int, k_offset: Int)

Loads a half-tile from NHWC global memory into the SMEM dst.

Two paths:

Pure-pointwise fast path (R=S=1, stride=1, dilation=1, pad=0): GEMM address addr = ((n*H + h)*W + w)*C + c collapses to addr = m * C + k. Identical to TileLoaderLDS.load_tile with stride = C; the per-lane vs uniform offset split is preserved so each iteration issues one buffer_load_*_lds per lane, matching the matmul's vmcnt accounting.
General R×S path (M2): the lane's (m_lane, k_lane) are decomposed at runtime — k_lane → (kh, kw, c) via comptime R, S, C divisors (constant-folded to multiply-by-magic); m_lane → (n, h_out, w_out) via comptime H_out, W_out divisors. Then h_in = h_out * stride_h + kh * dilation_h - pad_h (similarly for w_in) and addr = ((n*H + h_in)*W + w_in)*C + c. The full per-lane address goes into vector_offset; scalar_offset = 0. Costs more VGPRs per load than the fast path because the address decomposition can't be cleanly split into a uniform + per-lane pair (the m → (n, h_out, w_out) decomposition is non-linear).

Args:

dst (TileTensor[Self.dtype, address_space=AddressSpace.SHARED]): Destination half-tile in SHARED address space.
m_offset (Int): M-coordinate within the block (added to self.m_anchor to form the absolute GEMM M coord = flat NH_outW_out index).
k_offset (Int): K-coordinate within the block (added to self.k_anchor to form the absolute GEMM K coord = flat (kh, kw, c) index ∈ [0, RSC)). Must be a multiple of tile_cols.

Parameters​

Fields​

Implemented traits​

comptime members​

dtype​

elements_per_warp​

lane_load_bytes​

loading_threads​

num_iterations​

num_warp_cols​

num_warp_rows​

row_bytes​

rows_per_iteration​

rows_per_warp​

subtile_cols​

thread_rows​

threads_per_row​

tile_cols​

tile_rows​

total_warp_rows​

warp_subtile_bytes​

Methods​

__init__​

load_tile​

Parameters

Fields

Implemented traits

`comptime` members

`dtype`

`elements_per_warp`

`lane_load_bytes`

`loading_threads`

`num_iterations`

`num_warp_cols`

`num_warp_rows`

`row_bytes`

`rows_per_iteration`

`rows_per_warp`

`subtile_cols`

`thread_rows`

`threads_per_row`

`tile_cols`

`tile_rows`

`total_warp_rows`

`warp_subtile_bytes`

Methods

`init`

`load_tile`