Mojo function
allreduce
allreduce[dtype: DType, rank: Int, ngpus: Int, output_lambda: OptionalReg[fn[dtype: DType, rank: Int, width: Int, *, alignment: Int](IndexList[rank], SIMD[dtype, width]) capturing -> None] = None, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False, use_quickreduce: Bool = False](input_buffers: InlineArray[NDBuffer[dtype, rank, MutAnyOrigin], 1 if use_multimem else ngpus], output_buffer: NDBuffer[dtype, rank, MutAnyOrigin], rank_sigs: InlineArray[LegacyUnsafePointer[Signal], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None, iteration: Int = 0)
Per-GPU allreduce for use in multi-threaded contexts.
Currently requires prior single-threaded call to init_comms, as thread-safe version not yet implemented.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!