Skip to main content
Log in

Mojo module

mma_util

This module provides abstractions for doing matrix-multiply-accumulate (mma) using tensor cores. PTX Documentation => https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-1688 AMD Documentation => https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/

Functions

  • load_matrix_a: For shape m16n8k8 type tf32 loads matrix A tile from memory to registers in specific order to be used by tensor cores to perform a warp sync mma op.
  • load_matrix_a_amd:
  • load_matrix_b: For shape m16n8k8 & type tf32 loads matrix B tile from memory to registers in specific order to be used by tensor cores to perform a warp sync mma op.
  • load_matrix_b_amd:
  • store_matrix_d: Stores matrix D tile from registers to memory in specific order after performing tensor core based warp sync mma op.