Skip to main content

Mojo module

matmul_kernel

Simdgroup-tiled Apple M5 matmul kernel built on MmaOpApple.

64x64 output tile per threadgroup; four simdgroups (128 threads) each own a 32x32 subtile (2x2 MmaOpApple). A per-simdgroup runtime branch picks between an unbounded fast path and a bounded path that handles ragged M/N edges and partial K tails.

Block-to-tile: each threadgroup decodes block_idx.x via morton_decode_2d_rect over a side_m * side_n grid (each axis padded to the next pow2). Threadgroups outside (grid_m, grid_n) early-return.

comptime values​

BK​

comptime BK = 16

BM​

comptime BM = 64

BN​

comptime BN = 64

NUM_SG​

comptime NUM_SG = (NUM_SG_M * NUM_SG_N)

NUM_SG_M​

comptime NUM_SG_M = (BM / SG_M)

NUM_SG_N​

comptime NUM_SG_N = (BN / SG_N)

SG_M​

comptime SG_M = 32

SG_N​

comptime SG_N = 32

THREADS_PER_BLOCK​

comptime THREADS_PER_BLOCK = (NUM_SG * Int[Int](WARP_SIZE))

Functions​