For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

load_AB_cuda_core

def load_AB_cuda_core[a_type: DType, b_type: DType, a_dim0: Int, a_dim1: Int, a_num_tiles: Int, a_swizzle_bytes: Int, b_dim0: Int, b_dim1: Int, b_num_tiles: Int, b_swizzle_bytes: Int, num_pipeline_stages: Int, //, *, K_actual: Int, cta_group: Int = Int(1), a_swizzle: TensorMapSwizzle = TensorMapSwizzle.SWIZZLE_32B, b_swizzle: TensorMapSwizzle = TensorMapSwizzle.SWIZZLE_32B, a_gmem_layout: Layout = Layout.row_major(Int(1), Int(1)), b_gmem_layout: Layout = Layout.row_major(Int(1), Int(1))](a_gmem: LayoutTensor[a_type, a_gmem_layout, ImmutAnyOrigin], b_gmem: LayoutTensor[b_type, b_gmem_layout, ImmutAnyOrigin], expert_ids: UnsafePointer[Int32], a_smem_tiles: SMemTileArray2D[a_type, a_dim0, a_dim1, a_num_tiles, a_swizzle_bytes], b_smem_tiles: SMemTileArray2D[b_type, b_dim0, b_dim1, b_num_tiles, b_swizzle_bytes], mma_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace.SHARED], tma_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace.SHARED], producer_phase: PipelineState[num_pipeline_stages], peer_cta_coord: Tuple[Int, Int, Int], work_tile_coord: Tuple[Int, Int], iter_idx: UInt32, scheduler: TileScheduler[static_MN=scheduler.static_MN, tile_shape=scheduler.tile_shape, cluster=scheduler.cluster, cta_group=scheduler.cta_group, swizzle=scheduler.swizzle, swapAB=scheduler.swapAB])

CUDA core fallback for load_AB when K*sizeof < 16 bytes.

Copies [BM, BK] and [BN, BK] tiles from gmem LayoutTensors into swizzled smem, zero-filling columns where k >= K_actual.