For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo module
amd_4wave_matmul
4-wave matmul for AMD MI355X (CDNA4).
Entry point: AMD4WaveMatmul.run() (matmul) and .run_conv2d() (conv).
Host launcher: structured_4wave_matmul() (matmul-only); the conv
launcher lives in nn/conv/gpu/amd/amd_4wave_conv.mojo.
4-warp 2x2 quadrant layout with cross-stage register rotation,
adapted from HipKittens FP8_4wave's matmul_device_*:
- 4 mini-iters per loop iter, each with
G_load + frag_load + mma_ABt. - Cross-stage register rotation: a[0]/b[0] are reloaded mid-iter from
the
nextstage so iter k+1's first MMA can fire without waiting on LDS. - 2-iter epilogue drain.
Body is driven by the framework schedule (Pipeline4Wave in
amd_4wave_schedule.mojo) under SchedulingStrategy.IDENTITY +
minimal_barriers + omit_mma_set_prio. The schedule consumes the
logical 24-op cross-stage-rotation body and derives wait counts /
barriers from KernelGeometry. Supports FP8 (E4M3FN), BF16, and FP16
through a single body β MMA shape and BK select on dtype.
comptime valuesβ
KernelConfigβ
comptime KernelConfig = MatmulKernelConfig
Structsβ
- β
AMD4WaveMatmul: Hand-written 4-warp 2x2 inline-MMA matmul for AMD MI355X. - β
Conv2DKernelConfig: Conv-specific geometry forAMD4WaveMatmul's conv2d entry point. - β
MatmulKernelConfig: Block/warp/MMA shape configuration for 4-wave kernels.
Functionsβ
- β
s_barrier: - β
s_setprio: - β
structured_4wave_matmul: Canonical 4-wave matmul launcher (mirror ofamd_ping_pong_matmul).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!