Mojo module
tma_async
Tensor Memory Accelerator (TMA) Asynchronous Operations Module
Provides high-performance abstractions for NVIDIA's Tensor Memory Accelerator (TMA), enabling efficient asynchronous data movement between global and shared memory in GPU kernels. It is designed for use with NVIDIA Hopper architecture and newer GPUs that support TMA instructions.
Key Components:
-
TMATensorTile
: Core struct that encapsulates a TMA descriptor for efficient data transfers between global and shared memory with various access patterns and optimizations. -
SharedMemBarrier
: Synchronization primitive for coordinating asynchronous TMA operations, ensuring data transfers complete before dependent operations begin. -
PipelineState
: Helper struct for managing multi-stage pipeline execution with circular buffer semantics, enabling efficient double or triple buffering techniques. -
create_tma_tile
: Factory functions for creating optimizedTMATensorTile
instances with various configurations for different tensor shapes and memory access patterns.
Structs
-
PipelineState
: Manages state for a multi-stage pipeline with circular buffer semantics. -
SharedMemBarrier
: A hardware-accelerated synchronization primitive for GPU shared memory operations. -
TMATensorTile
: A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement. -
TMATensorTileArray
: An array of TMA descripotr.
Functions
-
create_tma_tile
: Creates aTMATensorTile
with specified tile dimensions and swizzle mode.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!