Mojo module

tma_async

Tensor Memory Accelerator (TMA) Asynchronous Operations Module

Provides high-performance abstractions for NVIDIA's Tensor Memory Accelerator (TMA), enabling efficient asynchronous data movement between global and shared memory in GPU kernels. It is designed for use with NVIDIA Hopper architecture and newer GPUs that support TMA instructions.

Key Components:

TMATensorTile: Core struct that encapsulates a TMA descriptor for efficient data transfers between global and shared memory with various access patterns and optimizations.
SharedMemBarrier: Synchronization primitive for coordinating asynchronous TMA operations, ensuring data transfers complete before dependent operations begin.
PipelineState: Helper struct for managing multi-stage pipeline execution with circular buffer semantics, enabling efficient double or triple buffering techniques.
create_tma_tile: Factory functions for creating optimized TMATensorTile instances with various configurations for different tensor shapes and memory access patterns.

`comptime` values

`SplitLastDimTMATensorTile`

comptime SplitLastDimTMATensorTile[rank: Int, //, dtype: DType, smem_shape: IndexList[rank], swizzle_mode: TensorMapSwizzle] = TMATensorTile[dtype, _split_last_layout[dtype](smem_shape, swizzle_mode, True), _ragged_desc_layout[dtype](smem_shape, swizzle_mode)]

A specialized TMA tensor tile type alias that handles layouts where the last dimension is split based on swizzle granularity for optimal memory access patterns. The current behavior is to not actually split the last dimension.

Parameters

rank (Int): The number of dimensions of the tensor.
dtype (DType): The data type of the tensor elements.
smem_shape (IndexList): The shape of the tile in shared memory. The last dimension will be padded if necessary to align with the swizzle granularity.
swizzle_mode (TensorMapSwizzle): The swizzling mode for memory access optimization. Determines the granularity at which the last dimension is split or padded.

Structs

PipelineState: Manages state for a multi-stage pipeline with circular buffer semantics.
RaggedTensorMap: Creates a TMA descriptor that can handle stores with varying lengths. This struct is mainly used for MHA, where sequence lengths may vary between sample.
RaggedTMA3DTile: Creates a TMA descriptor for loading/storing from ragged 3D arrays with a ragged leading dimension. This loads 2D tiles, indexing into the middle dim. When using this loads, it is essential that at least BM * stride space has been allocated in front of the gmem pointer, otherwise CUDA_ERROR_ILLEGAL_ADDRESS may result.
SharedMemBarrier: A hardware-accelerated synchronization primitive for GPU shared memory operations.
TMATensorTile: A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement.
TMATensorTileArray: An array of TMA descripotr.
TMATensorTileIm2col: TMA tensor tile with im2col coordinate transformation for convolution.

Functions

create_split_tma: Creates a TMA tensor tile assuming that the first dimension in global memory has UNKNOWN_VALUE.
create_tensor_tile: Creates a TMATensorTile with advanced configuration options for 2D, 3D, 4D, or 5D tensors.
create_tensor_tile_im2col: Creates a TMA tensor tile with im2col transformation for 2D convolution.
create_tma_tile: Creates a TMATensorTile with specified tile dimensions and swizzle mode.
create_tma_tile_template: Same as create_tma_tile expect the descriptor is only a placeholder or a template for later replacement.

Key Components:​

comptime values​

SplitLastDimTMATensorTile​

Parameters​

Structs​

Functions​