Mojo module
tile_loader
TileLoader module for efficient tile loading in GPU matrix multiplication.
This module provides utilities for loading matrix tiles from global memory to shared memory using two different mechanisms:
-
TMA (Tensor Memory Accelerator): Hardware-accelerated loads that can efficiently transfer 2D tiles with multicast support for multi-block clusters.
-
cp.async: Software-based asynchronous copy instructions with manual bounds checking and swizzling for optimal shared memory access patterns.
The TileLoader struct abstracts these loading mechanisms to provide a unified interface for the matmul kernel's producer threads.
Structsβ
- β
CPAsyncBarrierHandler: The cp.async barrier handler: noop on prepare, arrives on complete. - β
TileLoaderCPAsync: Software-based tile loader using cp.async instructions. - β
TileLoaderTMA: TMA-based tile loader for hardware-accelerated memory transfers. - β
TMABarrierHandler: TMA barrier handler: sets expected bytes on prepare, noop on complete.
Traitsβ
- β
BarrierHandler: Handles barrier lifecycle for different transfer mechanisms. - β
TileLoader: Base trait for tile loading mechanisms in matrix multiplication.
Functionsβ
- β
async_copy_with_bound_check: Helper function for cp.async with boundary checking.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!