Mojo function

allreduce

allreduce[dtype: DType, input_origin: ImmutOrigin, rank_sigs_origin: MutOrigin, //, rank: Int, ngpus: Int, output_lambda: Optional[elementwise_epilogue_type] = None, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_buffers: InlineArray[NDBuffer[dtype, input_origin], 1 if use_multimem else ngpus], output_buffer: NDBuffer[dtype, MutAnyOrigin], rank_sigs: InlineArray[UnsafePointer[Signal, rank_sigs_origin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)

Per-GPU allreduce for use in multi-threaded contexts.

Currently requires prior single-threaded call to init_comms, as thread-safe version not yet implemented.