Mojo package

arch

Architecture-specific MMA implementations.

This package contains GPU architecture-specific implementations of matrix multiply-accumulate (MMA) operations:

mma_nvidia: NVIDIA tensor cores (SM70-SM90) - Volta through Hopper
mma_nvidia_sm100: NVIDIA Blackwell (SM100) tensor cores - 5th gen tensor cores
mma_amd: AMD Matrix Cores (CDNA2/3/4) - Data center GPUs
mma_amd_rdna: AMD WMMA (RDNA3/4) - Consumer GPUs

Module Organization

Each architecture module contains:

Private implementation functions (prefixed with _)
Architecture-specific intrinsic calls
Data type conversions specific to that architecture

Usage

These modules should not be imported directly by user code. Instead, use the unified interface in gpu.compute.mma which automatically dispatches to the appropriate architecture-specific implementation at compile time:

from gpu.compute import mma

# Automatically dispatches to the correct architecture
result = mma(a, b, c)

Internal Implementation Details

The main gpu.compute.mma module imports these implementations:

from .arch.mma_nvidia import _mma_nvidia
from .arch.mma_amd import _mma_amd

And dispatches based on compile-time architecture detection:

@parameter
if is_nvidia_gpu():
    _mma_nvidia(d, a, b, c)
elif is_amd_gpu():
    _mma_amd[block_size](d, a, b, c)

Modules

mma_amd: AMD CDNA Matrix Cores implementation for matrix multiply-accumulate operations.
mma_amd_rdna: AMD RDNA3/4 WMMA implementation for matrix multiply-accumulate operations.
mma_nvidia: NVIDIA Tensor Cores implementation for matrix multiply-accumulate operations.
mma_nvidia_sm100: This module includes utilities for working with the SM100 MMA instructions.

Module Organization​

Usage​

Internal Implementation Details​

Modules​

Module Organization

Usage

Internal Implementation Details

Modules