Mojo function

matmul_dispatch_sm100_seperate_epilogue

matmul_dispatch_sm100_seperate_epilogue[c_type: DType, a_type: DType, b_type: DType, //, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None] = OptionalReg[fn[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None]({:i1 0, 1}), pdl_level: PDLLevel = PDLLevel(), block_swizzle_size: Int = 0, cta_group: Int = 2, num_pipeline_stages: Optional[UInt] = None](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], ctx: DeviceContext)

Our sm100 matmul kernel still does not support fusion of elementwise operations. This is a temporary implementation that uses our sm100 matmul kernel and dispatch a separate epilogue kernel to apply the elementwise operations if there is any.