Mojo module

mma_util

Matrix multiply accumulate (MMA) utilities for GPU tensor cores.

This module provides functions for loading matrix tiles from memory into registers and storing results back to memory when using tensor cores for matrix multiplication. It supports both NVIDIA and AMD GPUs with functions specialized for different data types (FP32, FP16, BF16).

The key functions are:

load_matrix_a: Loads tiles from the first input matrix A
load_matrix_b: Loads tiles from the second input matrix B
store_matrix_d: Stores result tiles to the output matrix D

Each function handles the specific memory access patterns required by the tensor core instructions on each GPU architecture. The tile sizes and data layouts match the hardware requirements documented in:

NVIDIA PTX: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-1688 AMD Matrix Cores: https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/

Functions

load_matrix_a: Loads a tile of matrix A from memory to registers for TF32 tensor core operations.
load_matrix_a_amd: Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations.
load_matrix_b: Loads a tile of matrix B from memory to registers for TF32 tensor core operations.
load_matrix_b_amd: Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations.
store_matrix_d: Stores matrix D tile from registers to memory after tensor core operation.

Functions​

Functions