Mojo function
allgather
allgather[dtype: DType, rank: Int, ngpus: Int](input_buffers: InlineArray[NDBuffer[dtype, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[dtype, rank, MutableAnyOrigin], (ngpus * ngpus)], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext], _max_num_blocks: Optional[Int] = None)
Performs all-gather across GPUs with variadic output.
Each device receives individual copies of all input buffers.
The implementation automatically selects between P2P and non-P2P paths based on hardware capabilities.
Parameters:
- dtype (
DType
): DType - The data type of tensor elements. - rank (
Int
): Int - Number of dimensions in input tensors. - ngpus (
Int
): Int - Number of GPUs participating in all-gather.
Args:
- input_buffers (
InlineArray
): Input buffers from each GPU. - output_buffers (
InlineArray
): Flat array of ngpus * ngpus output buffers. Layout: output_buffers[device_idx * ngpus + input_idx] contains device_idx's copy of input_idx's data. - rank_sigs (
InlineArray
): Signal pointers for P2P synchronization. - ctxs (
List
): List of device contexts for participating GPUs. - _max_num_blocks (
Optional
): Maximum number of blocks for kernel launch (optional).
allgather[dtype: DType, rank: Int, ngpus: Int](input_buffers: InlineArray[NDBuffer[dtype, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[dtype, rank, MutableAnyOrigin], (ngpus * ngpus)], ctxs: List[DeviceContext])
Backward compatible version without rank_sigs parameter.
This version uses the naive implementation since we can't allocate signal buffers with proper lifetime in this function.
Deprecated:
Use the signal_buffers
overload of allgather
instead.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!