Skip to main content
Log in

Mojo module

tensor_core_async

Tensor Core Async Module

This module provides high-performance abstractions for utilizing NVIDIA's Tensor Cores to perform asynchronous matrix multiplication operations. It implements optimized memory layouts and access patterns for efficient tensor core computations.

Key components:

  • Layout creation functions for K-major and MN-major memory arrangements
  • Swizzling support for improved memory access patterns
  • WGMMA (Warp Group Matrix Multiply-Accumulate) descriptor generation
  • TensorCoreAsync struct with methods for asynchronous matrix multiplication

The module supports various data types, matrix dimensions, and memory configurations, enabling efficient implementation of deep learning primitives and other tensor operations that can leverage hardware acceleration.

Performance features:

  • Asynchronous execution model to overlap computation and memory access
  • Support for different swizzling modes to optimize memory bandwidth
  • Efficient register and shared memory utilization
  • Support for multi-warp group execution

This implementation is specifically optimized for NVIDIA GPUs with Tensor Core support.

Aliases

  • WGMMA_K_BYTES = 32:

Structs

  • TensorCoreAsync: High-performance asynchronous tensor core operations for matrix multiplication.

Functions