Mojo module
tile_loader
TileLoader module for efficient tile loading in GPU matrix multiplication.
This module provides utilities for loading matrix tiles from global memory to shared memory using two different mechanisms:
-
TMA (Tensor Memory Accelerator): Hardware-accelerated loads that can efficiently transfer 2D tiles with multicast support for multi-block clusters.
-
cp.async: Software-based asynchronous copy instructions with manual bounds checking and swizzling for optimal shared memory access patterns.
The TileLoader struct abstracts these loading mechanisms to provide a unified interface for the matmul kernel's producer threads.
Structs
-
TileLoaderCPAsync
: Software-based tile loader using cp.async instructions. -
TileLoaderTMA
: TMA-based tile loader for hardware-accelerated memory transfers.
Traits
-
TileLoader
: Base trait for tile loading mechanisms in matrix multiplication.
Functions
-
async_copy_with_bound_check
: Helper function for cp.async with boundary checking.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!