Skip to main content
Log in

Mojo module

tensor_core

Tensor Core Module for High-Performance Matrix Operations

Provides abstractions for using GPU Tensor Cores to perform optimized matrix operations. It supports both NVIDIA and AMD GPU architectures with hardware-specific optimizations.

Key Components:

  • TensorCore: Core struct that encapsulates tensor core operations with support for various data types and matrix shapes. It handles loading matrix fragments, performing matrix multiply-accumulate operations, and storing results.

  • Matrix Fragment Management: Functions for loading and storing matrix fragments to/from shared memory with hardware-specific optimizations.

  • Matrix Multiply-Accumulate (MMA): Optimized implementations of matrix multiplication operations using tensor cores.

Supported Operations:

  • Matrix loading with various layouts and swizzling patterns
  • Matrix multiply-accumulate (D = A * B + C)
  • Matrix storing with hardware-specific optimizations

Supported Data Types:

  • NVIDIA: float32, bfloat16, float16, float8_e4m3fn, float8_e5m2
  • AMD: float32, bfloat16, float16

Supported Matrix Shapes:

  • NVIDIA: 16×8×8, 16×8×4, 16×8×16, 8×8×4, 16×8×32
  • AMD: 16×16×4, 16×16×16

Aliases

  • shape_16x16x16 = IndexList(16, 16, 16):
  • shape_16x16x4 = IndexList(16, 16, 4):
  • shape_16x8x16 = IndexList(16, 8, 16):
  • shape_16x8x32 = IndexList(16, 8, 32):
  • shape_16x8x4 = IndexList(16, 8, 4):
  • shape_16x8x8 = IndexList(16, 8, 8):
  • shape_8x8x4 = IndexList(8, 8, 4):
  • shape_null = IndexList(0, 0, 0):

Structs

  • TensorCore: TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations.

Functions

  • get_fragment_size: Calculates the fragment size per thread for a given MMA shape.
  • get_mma_shape: Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations.
  • num_matrix_reg: Calculates the number of matrix registers required per thread.