For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

amd_4wave_matmul

4-wave matmul for AMD MI355X (CDNA4).

Entry point: AMD4WaveMatmul.run() (matmul) and .run_conv2d() (conv). Host launcher: structured_4wave_matmul() (matmul-only); the conv launcher lives in nn/conv/gpu/amd/amd_4wave_conv.mojo.

4-warp 2x2 quadrant layout with cross-stage register rotation, adapted from HipKittens FP8_4wave's matmul_device_*:

4 mini-iters per loop iter, each with G_load + frag_load + mma_ABt.
Cross-stage register rotation: a[0]/b[0] are reloaded mid-iter from the next stage so iter k+1's first MMA can fire without waiting on LDS.
2-iter epilogue drain.

Body is driven by the framework schedule (Pipeline4Wave in amd_4wave_schedule.mojo) under SchedulingStrategy.IDENTITY + minimal_barriers + omit_mma_set_prio. The schedule consumes the logical 24-op cross-stage-rotation body and derives wait counts / barriers from KernelGeometry. Supports FP8 (E4M3FN), BF16, and FP16 through a single body — MMA shape and BK select on dtype.

`comptime` values

`KernelConfig`

comptime KernelConfig = MatmulKernelConfig

Structs

AMD4WaveMatmul: Hand-written 4-warp 2x2 inline-MMA matmul for AMD MI355X.
Conv2DKernelConfig: Conv-specific geometry for AMD4WaveMatmul's conv2d entry point.
MatmulKernelConfig: Block/warp/MMA shape configuration for 4-wave kernels.

Functions

s_barrier:
s_setprio:
structured_4wave_matmul: Canonical 4-wave matmul launcher (mirror of amd_ping_pong_matmul).

comptime values​

KernelConfig​

Structs​

Functions​

`comptime` values

`KernelConfig`

Structs

Functions