Skip to main content

Mojo module

tile_loader

TileLoader module for efficient tile loading in GPU matrix multiplication.

This module provides utilities for loading matrix tiles from global memory to shared memory using two different mechanisms:

  1. TMA (Tensor Memory Accelerator): Hardware-accelerated loads that can efficiently transfer 2D tiles with multicast support for multi-block clusters.

  2. cp.async: Software-based asynchronous copy instructions with manual bounds checking and swizzling for optimal shared memory access patterns.

The TileLoader struct abstracts these loading mechanisms to provide a unified interface for the matmul kernel's producer threads.

Structs

Traits

  • TileLoader: Base trait for tile loading mechanisms in matrix multiplication.

Functions

Was this page helpful?