Mojo function
blackwell_block_scaled_matmul_tma_umma_warp_specialized
blackwell_block_scaled_matmul_tma_umma_warp_specialized[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, group_offsets_layout: Layout, group_scale_offsets_layout: Layout, b_type: DType, b_layout: Layout, expert_ids_layout: Layout, sfa_dtype: DType, sfa_layout: Layout, sfb_dtype: DType, sfb_layout: Layout, expert_scale_layout: Layout, transpose_b: Bool, *, config: BlockScaledMatmulConfig[a_type, b_type, c_type, sfa_dtype, sfb_dtype, transpose_b], elementwise_compute_lambda_fn: Optional[elementwise_compute_lambda_type] = None, register_based_epilogue: Bool = True, pdl_level: PDLLevel = PDLLevel(), max_profiled_tiles_per_SM: Optional[UInt32] = None](c_device: LayoutTensor[c_type, c_layout, c_device.origin, address_space=c_device.address_space, element_layout=c_device.element_layout, layout_int_type=c_device.layout_int_type, linear_idx_type=c_device.linear_idx_type, masked=c_device.masked, alignment=c_device.alignment], a_device: LayoutTensor[a_type, a_layout, a_device.origin, address_space=a_device.address_space, element_layout=a_device.element_layout, layout_int_type=a_device.layout_int_type, linear_idx_type=a_device.linear_idx_type, masked=a_device.masked, alignment=a_device.alignment], group_offsets: LayoutTensor[DType.uint32, group_offsets_layout, group_offsets.origin, address_space=group_offsets.address_space, element_layout=group_offsets.element_layout, layout_int_type=group_offsets.layout_int_type, linear_idx_type=group_offsets.linear_idx_type, masked=group_offsets.masked, alignment=group_offsets.alignment], group_scale_offsets: LayoutTensor[DType.uint32, group_scale_offsets_layout, group_scale_offsets.origin, address_space=group_scale_offsets.address_space, element_layout=group_scale_offsets.element_layout, layout_int_type=group_scale_offsets.layout_int_type, linear_idx_type=group_scale_offsets.linear_idx_type, masked=group_scale_offsets.masked, alignment=group_scale_offsets.alignment], b_device: LayoutTensor[b_type, b_layout, b_device.origin, address_space=b_device.address_space, element_layout=b_device.element_layout, layout_int_type=b_device.layout_int_type, linear_idx_type=b_device.linear_idx_type, masked=b_device.masked, alignment=b_device.alignment], expert_ids: LayoutTensor[DType.int32, expert_ids_layout, expert_ids.origin, address_space=expert_ids.address_space, element_layout=expert_ids.element_layout, layout_int_type=expert_ids.layout_int_type, linear_idx_type=expert_ids.linear_idx_type, masked=expert_ids.masked, alignment=expert_ids.alignment], a_scales: LayoutTensor[sfa_dtype, sfa_layout, MutAnyOrigin], b_scales: LayoutTensor[sfb_dtype, sfb_layout, MutAnyOrigin], expert_scales: LayoutTensor[DType.float32, expert_scale_layout, MutAnyOrigin], num_active_experts: Int, ctx: DeviceContext)
Launch grouped block-scaled matmul kernel on SM100.
When config.AB_swapped is True, internally swaps A and B operands (along with their scale factors) and transposes the output for better performance when M is small.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!