Mojo function
allreduce_2stage_kernel
allreduce_2stage_kernel[type: DType, rank: Int, ngpus: Int](result: UnsafePointer[SIMD[type, 1]], src_bufs: StaticTuple[NDBuffer[type, rank], ngpus], rank_sigs: StaticTuple[UnsafePointer[Signal], 8], my_rank: Int, num_elements: Int)
2-stage allreduce algorithm for bandwidth-bound transfers.
This kernel implements a reduce-scatter + all-gather algorithm that is bandwidth optimal.
Arguments:
result: Output buffer for reduced values.
src_bufs: Input buffers from all GPUs.
rank_sigs: Signal pointers for synchronization.
IMPORTANT: the signal pointers have trailing buffers for
communication, which must be at least ngpus * sizeof(payload)
.
| -- sizeof(Signal) -- | ------ a few MB ----- |
my_rank: Current GPU rank.
num_elements: Number of elements to reduce.
Parameters:
- type (
DType
): Data type of tensor elements. - rank (
Int
): Number of dimensions in tensors. Note thatrank
is overloaded here to mean both device id and number of dimensions. - ngpus (
Int
): Number of GPUs participating.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!