Mojo function
blockwise_scaled_fp8_with_epilogue
blockwise_scaled_fp8_with_epilogue[c_type: DType, a_type: DType, b_type: DType, a_scales_type: DType, b_scales_type: DType, //, *, scales_granularity_mnk: IndexList[3], BLOCK_DIM: Int = 16, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None] = None, accum_type: DType = get_accum_type[c_type]()](c: LayoutTensor[c_type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a: LayoutTensor[a_type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b: LayoutTensor[b_type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_scales: LayoutTensor[a_scales_type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_scales: LayoutTensor[b_scales_type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContext)
Our sm100 blockwise scaled fp8 matmul kernel still does not support fusion of elementwise operations. This is a temporary implementation that uses our sm100 blockwise scaled fp8 matmul kernel and dispatch a separate epilogue kernel to apply the elementwise operations. For non B200 GPUs, we use the naive blockwise scaled fp8 matmul which support normal epilogue natively.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!