Mojo function
mma
mma[block_size: Int = 1](mut d: SIMD[dtype, size], a: SIMD[dtype, size], b: SIMD[dtype, size], c: SIMD[dtype, size])
Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation.
This function executes a matrix multiply-accumulate operation using GPU Tensor Cores, synchronizing across the warp. It dispatches to architecture-specific implementations for NVIDIA and AMD GPUs.
The operation performed is: d = (a * b) + c
Supported configurations depend on the GPU architecture:
- NVIDIA: Various combinations of FP32, FP16, BF16, and FP8 formats
- AMD: Limited subset of FP32 and FP16 operations
Note:
- All threads in a warp must execute this operation together
- Input matrices must be properly loaded and formatted for Tensor Core operations
- Matrix dimensions and data types must match hardware requirements
Parameters:
- block_size (
Int
): The size of the block of the MMA operation (e.g., 4x4x4_16B). Applies to AMD GPUs only.
Args:
- d (
SIMD[dtype, size]
): Output SIMD vector to store the result. - a (
SIMD[dtype, size]
): First input matrix as SIMD vector. - b (
SIMD[dtype, size]
): Second input matrix as SIMD vector. - c (
SIMD[dtype, size]
): Accumulator matrix as SIMD vector.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!