For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

matmul_kernel

Simdgroup-tiled Apple M5 matmul kernel built on MmaOpApple.

64x64 output tile per threadgroup; four simdgroups (128 threads) each own a 32x32 subtile (2x2 MmaOpApple). A per-simdgroup runtime branch picks between an unbounded fast path and a bounded path that handles ragged M/N edges and partial K tails.

Block-to-tile: each threadgroup decodes block_idx.x via morton_decode_2d_rect over a side_m * side_n grid (each axis padded to the next pow2). Threadgroups outside (grid_m, grid_n) early-return.

`comptime` values

`BK`

comptime BK = 16

`BM`

comptime BM = 64

`BN`

comptime BN = 64

`NUM_SG`

comptime NUM_SG = (NUM_SG_M * NUM_SG_N)

`NUM_SG_M`

comptime NUM_SG_M = 2

`NUM_SG_N`

comptime NUM_SG_N = 2

`SG_M`

comptime SG_M = 32

`SG_N`

comptime SG_N = 32

`THREADS_PER_BLOCK`

comptime THREADS_PER_BLOCK = (NUM_SG * Int[Int](WARP_SIZE))

Functions

apple_matmul_kernel: Apple M5 simdgroup-tiled GEMM: one 64x64 tile per threadgroup.
enqueue_apple_matmul: Enqueue the Apple M5 matmul kernel on the given device context.
morton_decode_2d: Decode a linear index to (tile_m, tile_n) via Morton Z-order.
morton_decode_2d_rect: Decode flat_idx to (tile_m, tile_n) over a (1<<log2_m) x (1<<log2_n) grid.

comptime values​

BK​

BM​

BN​

NUM_SG​

NUM_SG_M​

NUM_SG_N​

SG_M​

SG_N​

THREADS_PER_BLOCK​

Functions​