Skip to main content

Mojo package

arch

Architecture-specific MMA implementations.

This package contains GPU architecture-specific implementations of matrix multiply-accumulate (MMA) operations:

  • mma_nvidia: NVIDIA tensor cores (SM70-SM90) - Volta through Hopper
  • mma_nvidia_sm100: NVIDIA Blackwell (SM100) tensor cores - 5th gen tensor cores
  • mma_amd: AMD Matrix Cores (CDNA2/3/4) - Data center GPUs
  • mma_amd_rdna: AMD WMMA (RDNA3/4) - Consumer GPUs

Module Organization

Each architecture module contains:

  • Private implementation functions (prefixed with _)
  • Architecture-specific intrinsic calls
  • Data type conversions specific to that architecture

Usage

These modules should not be imported directly by user code. Instead, use the unified interface in gpu.compute.mma which automatically dispatches to the appropriate architecture-specific implementation at compile time:

from gpu.compute import mma

# Automatically dispatches to the correct architecture
result = mma(a, b, c)

Internal Implementation Details

The main gpu.compute.mma module imports these implementations:

from .arch.mma_nvidia import _mma_nvidia
from .arch.mma_amd import _mma_amd

And dispatches based on compile-time architecture detection:

@parameter
if is_nvidia_gpu():
    _mma_nvidia(d, a, b, c)
elif is_amd_gpu():
    _mma_amd[block_size](d, a, b, c)

Modules

  • mma_amd: AMD CDNA Matrix Cores implementation for matrix multiply-accumulate operations.
  • mma_amd_rdna: AMD RDNA3/4 WMMA implementation for matrix multiply-accumulate operations.
  • mma_nvidia: NVIDIA Tensor Cores implementation for matrix multiply-accumulate operations.
  • mma_nvidia_sm100: This module includes utilities for working with the SM100 MMA instructions.

Was this page helpful?