For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

matmul_kernel

Structured Apple M5 simdgroup-tiled matmul (Metal 4 hardware MMA).

Everything lives in the AppleM5MatMul struct (mirroring the AMD/NVIDIA structured kernels): the comptime config, the Morton tile scheduler, the B-layout helper, the single-pass GPU kernel (run), and the split-K kernels (run_split_k_partial / run_split_k_reduce). The enqueue_apple_matmul / enqueue_apple_matmul_split_k free functions are the host-side launchers (kept standalone so callers and tests dispatch without naming the struct).

64x64 output tile per threadgroup; four simdgroups (128 threads) each own a 32x32 subtile (2x2 MmaOpApple). A per-simdgroup runtime branch picks between an unbounded fast path and a bounded path for ragged M/N edges and partial K tails. Operands load DRAM->register directly -- threadgroup-memory staging degrades matmul on Apple Silicon. See kernels/apple-m5-matmul in the KB.

Structs

AppleM5MatMul: Apple M5 simdgroup-tiled GEMM (Metal 4 hardware MMA).
DenseALoader: Plain-GEMM A loader: holds the pre-tiled (SG_M, K) slab.
DenseWeightLoader: Direct-DRAM B policy: dense-bf16 and FP8-W8A16.
Im2colALoader: Fused-conv A loader: input_ptr + this simdgroup's prebaked pixel anchors and carried K-state. The gather reads the A MMA-fragment from NHWC via MmaOpApple.mma_im2col (the im2col matrix is non-affine, so it is not a distribute-expressible TileTensor -- KB exceptions/apple-mma-fragment-is-not-distribute-expressible). The anchor prebake + K-state strength-reduction design: KB kernels/apple-conv2d-im2col.

Traits

AOperandLoader: One BK-strip A contribution for the shared Apple GEMM body.
WeightLoader: B/weight-side comptime policy for the shared Apple GEMM body.

Functions

enqueue_apple_conv2d: Enqueue the fused online-im2col conv2d (AppleM5MatMul.run_conv).
enqueue_apple_matmul: Enqueue AppleM5MatMul.run on the given device context.
enqueue_apple_matmul_clamp_chain: Enqueue the clamp_v2 ragged-edge, 2-pass chained-accumulate dense GEMM (NN only). Pass 0 zero-seeds and covers the first half of K's strips; pass 1 seeds from c and covers the rest, overwriting c with the final result. No partials buffer, no separate reduce kernel.
enqueue_apple_matmul_split_k: Split-K Apple M5 matmul: partition K, accumulate partials, reduce.

Structs​

Traits​

Functions​

Structs

Traits

Functions