IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

amd_4wave_matmul

4-wave matmul for AMD MI355X (CDNA4).

Entry point: AMD4WaveMatmul.run() (matmul) and .run_conv2d() (conv). Host launcher: structured_4wave_matmul() (matmul-only); the conv launcher lives in nn/conv/gpu/amd/amd_4wave_conv.mojo.

4-warp 2x2 quadrant layout with cross-stage register rotation, adapted from HipKittens FP8_4wave's matmul_device_*:

  • 4 mini-iters per loop iter, each with G_load + frag_load + mma_ABt.
  • Cross-stage register rotation: a[0]/b[0] are reloaded mid-iter from the next stage so iter k+1's first MMA can fire without waiting on LDS.
  • 2-iter epilogue drain.

Body is driven by the framework schedule (Pipeline4Wave in amd_4wave_schedule.mojo) under SchedulingStrategy.IDENTITY + minimal_barriers + omit_mma_set_prio. The schedule consumes the logical 24-op cross-stage-rotation body and derives wait counts / barriers from KernelGeometry. Supports FP8 (E4M3FN), BF16, and FP16 through a single body β€” MMA shape and BK select on dtype.

comptime values​

KernelConfig​

comptime KernelConfig = MatmulKernelConfig

Structs​

Functions​