Mojo module
tensor_core
Tensor Core Module for High-Performance Matrix Operations
Provides abstractions for using GPU Tensor Cores to perform optimized matrix operations. It supports both NVIDIA and AMD GPU architectures with hardware-specific optimizations.
Key Components:
-
TensorCore
: Core struct that encapsulates tensor core operations with support for various data types and matrix shapes. It handles loading matrix fragments, performing matrix multiply-accumulate operations, and storing results. -
Matrix Fragment Management: Functions for loading and storing matrix fragments to/from shared memory with hardware-specific optimizations.
-
Matrix Multiply-Accumulate (MMA): Optimized implementations of matrix multiplication operations using tensor cores.
Supported Operations:
- Matrix loading with various layouts and swizzling patterns
- Matrix multiply-accumulate (D = A * B + C)
- Matrix storing with hardware-specific optimizations
Supported Data Types:
- NVIDIA: float32, bfloat16, float16, float8_e4m3fn, float8_e5m2
- AMD: float32, bfloat16, float16
Supported Matrix Shapes:
- NVIDIA: 16×8×8, 16×8×4, 16×8×16, 8×8×4, 16×8×32
- AMD: 16×16×4, 16×16×16
Aliases
-
shape_16x16x16 = IndexList(16, 16, 16)
: -
shape_16x16x4 = IndexList(16, 16, 4)
: -
shape_16x8x16 = IndexList(16, 8, 16)
: -
shape_16x8x32 = IndexList(16, 8, 32)
: -
shape_16x8x4 = IndexList(16, 8, 4)
: -
shape_16x8x8 = IndexList(16, 8, 8)
: -
shape_8x8x4 = IndexList(8, 8, 4)
: -
shape_null = IndexList(0, 0, 0)
:
Structs
-
TensorCore
: TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations.
Functions
-
get_fragment_size
: Calculates the fragment size per thread for a given MMA shape. -
get_mma_shape
: Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations. -
num_matrix_reg
: Calculates the number of matrix registers required per thread.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!