Mojo module
tma_async
Tensor Memory Accelerator (TMA) Asynchronous Operations Module
Provides high-performance abstractions for NVIDIA's Tensor Memory Accelerator (TMA), enabling efficient asynchronous data movement between global and shared memory in GPU kernels. It is designed for use with NVIDIA Hopper architecture and newer GPUs that support TMA instructions.
Key Components:
-
TMATensorTile: Core struct that encapsulates a TMA descriptor for efficient data transfers between global and shared memory with various access patterns and optimizations. -
SharedMemBarrier: Synchronization primitive for coordinating asynchronous TMA operations, ensuring data transfers complete before dependent operations begin. -
PipelineState: Helper struct for managing multi-stage pipeline execution with circular buffer semantics, enabling efficient double or triple buffering techniques. -
create_tma_tile: Factory functions for creating optimizedTMATensorTileinstances with various configurations for different tensor shapes and memory access patterns.
comptime values
SplitLastDimTMATensorTile
comptime SplitLastDimTMATensorTile[rank: Int, //, dtype: DType, smem_shape: IndexList[rank], swizzle_mode: TensorMapSwizzle] = TMATensorTile[dtype, _split_last_layout[dtype](smem_shape, swizzle_mode, True), _ragged_desc_layout[dtype](smem_shape, swizzle_mode)]
A specialized TMA tensor tile type alias that handles layouts where the last dimension is split based on swizzle granularity for optimal memory access patterns. The current behavior is to not actually split the last dimension.
Parameters
- rank (
Int): The number of dimensions of the tensor. - dtype (
DType): The data type of the tensor elements. - smem_shape (
IndexList): The shape of the tile in shared memory. The last dimension will be padded if necessary to align with the swizzle granularity. - swizzle_mode (
TensorMapSwizzle): The swizzling mode for memory access optimization. Determines the granularity at which the last dimension is split or padded.
Structs
-
PipelineState: Manages state for a multi-stage pipeline with circular buffer semantics. -
RaggedTensorMap: Creates a TMA descriptor that can handle stores with varying lengths. This struct is mainly used for MHA, where sequence lengths may vary between sample. -
SharedMemBarrier: A hardware-accelerated synchronization primitive for GPU shared memory operations. -
TMATensorTile: A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement. -
TMATensorTileArray: An array of TMA descripotr.
Functions
-
create_split_tma: Creates a TMA tensor tile assuming that the first dimension in global memory hasUNKNOWN_VALUE. -
create_tma_tile: Creates aTMATensorTilewith specified tile dimensions and swizzle mode. -
create_tma_tile_template: Same as create_tma_tile expect the descriptor is only a placeholder or a template for later replacement.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!