Mojo function

allreduce_2stage_kernel

allreduce_2stage_kernel[type: DType, rank: Int, ngpus: Int](result: UnsafePointer[SIMD[type, 1]], src_bufs: StaticTuple[NDBuffer[type, rank], ngpus], rank_sigs: StaticTuple[UnsafePointer[Signal], 8], my_rank: Int, num_elements: Int)

2-stage allreduce algorithm for bandwidth-bound transfers.

This kernel implements a reduce-scatter + all-gather algorithm that is bandwidth optimal.

Arguments: result: Output buffer for reduced values. src_bufs: Input buffers from all GPUs. rank_sigs: Signal pointers for synchronization. IMPORTANT: the signal pointers have trailing buffers for communication, which must be at least ngpus * sizeof(payload). | -- sizeof(Signal) -- | ------ a few MB ----- | my_rank: Current GPU rank. num_elements: Number of elements to reduce.

Parameters:

type (DType): Data type of tensor elements.
rank (Int): Number of dimensions in tensors. Note that rank is overloaded here to mean both device id and number of dimensions.
ngpus (Int): Number of GPUs participating.