Mojo module
distributed_matmul
Aliasesβ
elementwise_epilogue_typeβ
alias elementwise_epilogue_type = fn[input_index: Int, dtype: DType, rank: Int, width: Int, *, alignment: Int](IndexList[rank], SIMD[dtype, width]) capturing -> None
Functionsβ
- β
matmul_allreduce: Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices intonum_partitionssubmatrices at dimensionpartition_dim. This way we can performnum_partitionsindependent matmul + allreduce kernels, and overlap some of the computation.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!