IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

matmul_kernel

Simdgroup-tiled Apple M5 matmul kernel built on MmaOpApple.

64x64 output tile per threadgroup; four simdgroups (128 threads) each own a 32x32 subtile (2x2 MmaOpApple). A per-simdgroup runtime branch picks between an unbounded fast path and a bounded path that handles ragged M/N edges and partial K tails.

Block-to-tile: each threadgroup decodes block_idx.x via morton_decode_2d_rect over a side_m * side_n grid (each axis padded to the next pow2). Threadgroups outside (grid_m, grid_n) early-return.

comptime values​

BK​

comptime BK = 16

BM​

comptime BM = 64

BN​

comptime BN = 64

NUM_SG​

comptime NUM_SG = (NUM_SG_M * NUM_SG_N)

NUM_SG_M​

comptime NUM_SG_M = 2

NUM_SG_N​

comptime NUM_SG_N = 2

SG_M​

comptime SG_M = 32

SG_N​

comptime SG_N = 32

THREADS_PER_BLOCK​

comptime THREADS_PER_BLOCK = (NUM_SG * Int[Int](WARP_SIZE))

Functions​