For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

enqueue_apple_matmul

def enqueue_apple_matmul[in_type: DType, c_type: DType = DType.float32, transpose_b: Bool = False, elementwise_lambda_fn: Optional[def[dtype: DType, width: SIMDLength, *, alignment: Int = Int(1)](IndexList[Int(2)], SIMD[dtype, width]) capturing thin -> None] = None](c: TileTensor[c_type, Storage=c.Storage, address_space=c.address_space, linear_idx_type=c.linear_idx_type], a: TileTensor[in_type, Storage=a.Storage, address_space=a.address_space, linear_idx_type=a.linear_idx_type], b: TileTensor[in_type, Storage=b.Storage, address_space=b.address_space, linear_idx_type=b.linear_idx_type], ctx: DeviceContext, force_split_k: Optional[Bool] = None)

Enqueue AppleM5MatMul.run on the given device context.

Accepts row-major TileTensor operands. For transpose_b=True, B is expected with shape (N, K).

force_split_k picks the K-reduction strategy: None (default) auto-routes under-occupied shapes (few output tiles, deep K) to split-K; True always uses split-K; False always uses the single-pass kernel.

Parameters:

in_type (DType): A/B element type (fp16, bf16, fp32).
c_type (DType): Output element type (fp16, bf16, fp32). Accumulation is fp32 (defaults to float32).
transpose_b (Bool): If True, B is (N, K) row-major (viewed as col_major(K, N)); otherwise B is (K, N) row-major (defaults to False).
elementwise_lambda_fn (Optional[def[dtype: DType, width: SIMDLength, *, alignment: Int = Int(1)](IndexList[Int(2)], SIMD[dtype, width]) capturing thin -> None]): Optional fused epilogue; receives SIMD[c_type, width] at absolute (row, col) (AMD's contract) (defaults to None).

Args:

c (TileTensor[c_type, Storage=c.Storage, address_space=c.address_space, linear_idx_type=c.linear_idx_type]): Output matrix (M, N) row-major; M and N derive from its dims.
a (TileTensor[in_type, Storage=a.Storage, address_space=a.address_space, linear_idx_type=a.linear_idx_type]): A operand matrix (M, K) row-major; K derives from a.dim[1].
b (TileTensor[in_type, Storage=b.Storage, address_space=b.address_space, linear_idx_type=b.linear_idx_type]): B operand matrix, (K, N) for transpose_b=False or (N, K) for transpose_b=True.
ctx (DeviceContext): Device context to enqueue the kernel on; must be Apple M5 (compute_capability == 5).
force_split_k (Optional[Bool]): K-reduction strategy override; None auto-routes under-occupied shapes to split-K, True always uses split-K, False always uses the single-pass kernel (defaults to None).

Raises:

If the attached GPU is not Apple M5 (compute_capability != 5). M1-M4 lack GPU neural accelerator; future generations require re-validation.