Skip to main content

Mojo function

mma

mma[block_size: Int = 1](mut d: SIMD[dtype, size], a: SIMD[dtype, size], b: SIMD[dtype, size], c: SIMD[dtype, size])

Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation.

This function executes a matrix multiply-accumulate operation using GPU Tensor Cores, synchronizing across the warp. It dispatches to architecture-specific implementations for NVIDIA and AMD GPUs.

The operation performed is: d = (a * b) + c

Supported configurations depend on the GPU architecture:

  • NVIDIA: Various combinations of FP32, FP16, BF16, and FP8 formats
  • AMD: Limited subset of FP32 and FP16 operations

Note:

  • All threads in a warp must execute this operation together
  • Input matrices must be properly loaded and formatted for Tensor Core operations
  • Matrix dimensions and data types must match hardware requirements

Parameters:

  • block_size (Int): The size of the block of the MMA operation (e.g., 4x4x4_16B). Applies to AMD GPUs only.

Args:

  • d (SIMD[dtype, size]): Output SIMD vector to store the result.
  • a (SIMD[dtype, size]): First input matrix as SIMD vector.
  • b (SIMD[dtype, size]): Second input matrix as SIMD vector.
  • c (SIMD[dtype, size]): Accumulator matrix as SIMD vector.

Was this page helpful?