Mojo function

load_AB_cuda_core

load_AB_cuda_core[a_type: DType, b_type: DType, a_dim0: Int, a_dim1: Int, a_num_tiles: Int, a_swizzle_bytes: Int, b_dim0: Int, b_dim1: Int, b_num_tiles: Int, b_swizzle_bytes: Int, num_pipeline_stages: UInt, //, *, K_actual: Int, cta_group: Int = 1, a_swizzle: TensorMapSwizzle = TensorMapSwizzle.SWIZZLE_32B, b_swizzle: TensorMapSwizzle = TensorMapSwizzle.SWIZZLE_32B, a_gmem_layout: Layout = Layout.row_major(VariadicList(1, 1)), b_gmem_layout: Layout = Layout.row_major(VariadicList(1, 1))](a_gmem: LayoutTensor[a_type, a_gmem_layout, ImmutAnyOrigin], b_gmem: LayoutTensor[b_type, b_gmem_layout, ImmutAnyOrigin], expert_ids: UnsafePointer[Int32, ImmutAnyOrigin], a_smem_tiles: SMemTileArray2D[a_type, a_dim0, a_dim1, a_num_tiles, a_swizzle_bytes], b_smem_tiles: SMemTileArray2D[b_type, b_dim0, b_dim1, b_num_tiles, b_swizzle_bytes], mma_mbar: UnsafePointer[SharedMemBarrier, MutAnyOrigin, address_space=AddressSpace.SHARED], tma_mbar: UnsafePointer[SharedMemBarrier, MutAnyOrigin, address_space=AddressSpace.SHARED], producer_phase: PipelineState[Int.__init__[UInt](num_pipeline_stages)], peer_cta_coord: Tuple[UInt, UInt, UInt], work_tile_coord: Tuple[UInt, UInt], iter_idx: UInt32, scheduler: TileScheduler[static_MN=scheduler.static_MN, tile_shape=scheduler.tile_shape, cluster=scheduler.cluster, cta_group=scheduler.cta_group, swizzle=scheduler.swizzle, swapAB=scheduler.swapAB])

CUDA core fallback for load_AB when K*sizeof < 16 bytes.

Copies [BM, BK] and [BN, BK] tiles from gmem LayoutTensors into swizzled smem, zero-filling columns where k >= K_actual.