For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

apple_matmul_kernel

def apple_matmul_kernel[in_type: DType, c_type: DType = DType.float32, transpose_b: Bool = False, elementwise_lambda_fn: Optional[def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None] = None](d_ptr: UnsafePointer[Scalar[c_type], MutAnyOrigin], a_ptr: UnsafePointer[Scalar[in_type], ImmutAnyOrigin], b_ptr: UnsafePointer[Scalar[in_type], ImmutAnyOrigin], m: Int, n: Int, k: Int, log2_grid_m: UInt32, log2_grid_n: UInt32)

Apple M5 simdgroup-tiled GEMM: one 64x64 tile per threadgroup.

Grid is (1<<log2_grid_m) * (1<<log2_grid_n) threadgroups of 128 threads; OOB threadgroups early-return after Morton decode. For transpose_b=True, B is the (N, K) row-major buffer reinterpreted as col_major(K, N).