Mojo function
blackwell_block_scaled_matmul_tma_umma_warp_specialized
blackwell_block_scaled_matmul_tma_umma_warp_specialized[sfa_dtype: DType, sfb_dtype: DType, transpose_b: Bool, *, K: Int, config: BlockScaledMatmulConfig[config.a_type, config.b_type, config.c_type, sfa_dtype, sfb_dtype, transpose_b], elementwise_compute_lambda_fn: Optional[def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> SIMD[dtype, width]] = None, elementwise_lambda_fn: Optional[def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None] = None, pdl_level: PDLLevel = PDLLevel(1), max_profiled_tiles_per_SM: Optional[UInt32] = None](c_tensor: TileTensor[c_tensor.dtype, c_tensor.LayoutType, c_tensor.origin, address_space=c_tensor.address_space, linear_idx_type=c_tensor.linear_idx_type, element_size=c_tensor.element_size], a_tensor: TileTensor[a_tensor.dtype, a_tensor.LayoutType, a_tensor.origin, address_space=a_tensor.address_space, linear_idx_type=a_tensor.linear_idx_type, element_size=a_tensor.element_size], b_tensor: TileTensor[b_tensor.dtype, b_tensor.LayoutType, b_tensor.origin, address_space=b_tensor.address_space, linear_idx_type=b_tensor.linear_idx_type, element_size=b_tensor.element_size], a_scales_tensor: TileTensor[sfa_dtype, a_scales_tensor.LayoutType, a_scales_tensor.origin, address_space=a_scales_tensor.address_space, linear_idx_type=a_scales_tensor.linear_idx_type, element_size=a_scales_tensor.element_size], b_scales_tensor: TileTensor[sfb_dtype, b_scales_tensor.LayoutType, b_scales_tensor.origin, address_space=b_scales_tensor.address_space, linear_idx_type=b_scales_tensor.linear_idx_type, element_size=b_scales_tensor.element_size], ctx: DeviceContext, alpha: Float32 = 1)
Launch small-BN block-scaled FP8 matmul kernel on SM100.
A, B, C, and scale factors are all passed as TileTensors. A/B/C are 2D (non-batched) or 3D (batched). Scale factors are 5D (non-batched) or 6D (batched).
When config.AB_swapped is True, internally swaps A and B operands (along with their scale factors) and transposes the output.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!