Mojo function
blackwell_matmul_tma_umma_warp_specialized
blackwell_matmul_tma_umma_warp_specialized[transpose_b: Bool, *, config: MatmulConfig[config.a_type, config.b_type, config.c_type, transpose_b], elementwise_lambda_fn: Optional[def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None] = None, elementwise_compute_lambda_fn: Optional[def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> SIMD[dtype, width]] = None, pdl_level: PDLLevel = PDLLevel(), max_profiled_tiles_per_SM: Optional[UInt32] = None](c_device: TileTensor[c_device.dtype, c_device.LayoutType, c_device.origin, address_space=c_device.address_space, linear_idx_type=c_device.linear_idx_type, element_size=c_device.element_size], a_device: TileTensor[a_device.dtype, a_device.LayoutType, a_device.origin, address_space=a_device.address_space, linear_idx_type=a_device.linear_idx_type, element_size=a_device.element_size], b_device: TileTensor[b_device.dtype, b_device.LayoutType, b_device.origin, address_space=b_device.address_space, linear_idx_type=b_device.linear_idx_type, element_size=b_device.element_size], ctx: DeviceContext)
Public entry point for SM100 matmul (non-batched, rank-2 inputs).
Split-K uses separate 2D path. Non-split-K delegates to blackwell_batched_matmul_tma_umma_warp_specialized which handles _to_batched_3d wrapping and AB_swapped dispatch.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!