IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

apple_matmul_kernel

def apple_matmul_kernel[in_type: DType, c_type: DType = DType.float32, transpose_b: Bool = False, elementwise_lambda_fn: Optional[def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None] = None](d_ptr: UnsafePointer[Scalar[c_type], MutAnyOrigin], a_ptr: UnsafePointer[Scalar[in_type], ImmutAnyOrigin], b_ptr: UnsafePointer[Scalar[in_type], ImmutAnyOrigin], m: Int, n: Int, k: Int, log2_grid_m: UInt32, log2_grid_n: UInt32)

Apple M5 simdgroup-tiled GEMM: one 64x64 tile per threadgroup.

Grid is (1<<log2_grid_m) * (1<<log2_grid_n) threadgroups of 128 threads; OOB threadgroups early-return after Morton decode. For transpose_b=True, B is the (N, K) row-major buffer reinterpreted as col_major(K, N).