For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo package

amd

Provides the AMD GPU backend implementations for matmuls.

Modules

amd_4wave_matmul: 4-wave matmul for AMD MI355X (CDNA4).
amd_4wave_schedule: Inline 4-wave schedule for AMD GPU matmul / implicit-GEMM conv kernels.
amd_4wave_split_k_matmul: Single-launch split-K wrapper for the 4-wave FP8 matmul.
amd_matmul: Pure TileTensor structured AMD matmul kernel.
amd_matmul_schedule: Declarative software pipeline schedule for the default AMD matmul kernel.
amd_ping_pong_matmul: Structured ping-pong matmul for AMD MI355X (CDNA4).
amd_ping_pong_schedule: Ping-pong schedule for AMD GPU matmul kernels.
amd_target: AMD GPU target definitions for the pipeline scheduling framework.
matmul_mma: MMA operators for AMD matmul kernels.
mxfp4_dequant_matmul_amd: MXFP4 matmul on AMD CDNA GPUs via dequant-to-FP8 + FP8 GEMM.
mxfp4_grouped_matmul_amd:
mxfp4_matmul_amd: Native MXFP4 block-scaled matmul on AMD CDNA4 via f8f6f4 MFMA.
mxfp4_matmul_amd_preb: MXFP4 block-scaled matmul on AMD CDNA4 with preshuffled B + scales + direct VGPR loads.
mxfp4_moe_matmul_amd: MXFP4 x MXFP4 routed MoE matmul kernel for AMD CDNA4.
mxfp4_preshuffle_layouts: Host-side MXFP4 preshuffle layouts for AMD CDNA4 grouped MoE matmul.
mxfp4_preshuffle_loaders: Per-lane DRAM->VGPR loaders for the preshuffled MXFP4 MoE matmul.
pipeline_body: Builder for declarative pipeline body specifications.
ring_buffer: Ring Buffer implementation for producer-consumer synchronization in GPU kernels.
ring_buffer_traits: Trait definitions and utilities for ring buffer synchronization strategies.
structured:
warp_spec_matmul: AMD Warp-Specialized Matrix Multiplication.

Modules​

Modules