Mojo function

cp_async_k_major

cp_async_k_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout.

This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator (TMA) hardware. It optimizes for K-major memory access patterns, which is particularly beneficial for certain tensor operations like matrix multiplications where the inner dimension (K) is accessed contiguously.

The function automatically determines the optimal tile size and thread distribution based on the tensor shapes and hardware capabilities, leveraging TMA's efficient memory transfer mechanisms.

Example:

from layout import LayoutTensor, Layout
from layout.layout_tensor import cp_async_k_major
from gpu.memory import async_copy_wait_all

var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data with K-major layout optimization
cp_async_k_major[DType.float32](shared_data, global_data)

# Wait for the asynchronous copy to complete
async_copy_wait_all()
from layout import LayoutTensor, Layout
from layout.layout_tensor import cp_async_k_major
from gpu.memory import async_copy_wait_all

var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data with K-major layout optimization
cp_async_k_major[DType.float32](shared_data, global_data)

# Wait for the asynchronous copy to complete
async_copy_wait_all()

Performance:

Uses TMA hardware acceleration for optimal memory transfer performance.
Optimizes for K-major access patterns, which can significantly improve performance for certain tensor operations like matrix multiplications.
Performs asynchronous transfers, allowing computation to overlap with memory operations.
Automatically determines optimal tile sizes based on tensor dimensions.
Uses hardware-accelerated swizzling to reduce shared memory bank conflicts.

Notes:

This function requires NVIDIA GPUs with TMA support (compute capability 9.0+).
The source tensor must be in GENERIC or GLOBAL address space (DRAM).
The destination tensor must be in SHARED address space (SRAM).
Both tensors must have the same data type.
This function is asynchronous, so you must call async_copy_wait_all() or async_copy_wait_group() to ensure the copy has completed before using the data.
K-major layout is particularly beneficial for matrix multiplication operations where the inner dimension (K) is accessed contiguously.

Constraints:

Requires NVIDIA GPUs with TMA support (compute capability 9.0+).
Source tensor must be in GENERIC or GLOBAL address space.
Destination tensor must be in SHARED address space.
Both tensors must have the same data type.
Source and destination tensors must be 2D.

Parameters:

type (DType): The data type of the tensor elements.
eviction_policy (CacheEviction): The cache eviction policy to use. Default is CacheEviction.EVICT_NORMAL.

Args:

dst (LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor, which must be in shared memory (SRAM).
src (LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The source tensor, which must be in global or generic memory (DRAM).