For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

matmul_kernel

def matmul_kernel[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, elementwise_lambda_fn: Optional[def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None] = None, s_type: DType = get_accum_type[c_type]()](c_ptr: UnsafePointer[Scalar[c_type], MutAnyOrigin], a_ptr: UnsafePointer[Scalar[a_type], ImmutAnyOrigin], b_ptr: UnsafePointer[Scalar[b_type], ImmutAnyOrigin], m: Int, n: Int, k: Int)

Matrix Multiplication using shared memory. This version loads blocks of size tile_size x tile_size from A and B and updates a tile_size x tile_size in C. The thread block should have shape (tile_size, tile_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile_size, M/tile_size, 1). N is the first dimension for coalesced access.