Mojo function

allgather

allgather[dtype: DType, rank: Int, ngpus: Int](input_buffers: InlineArray[NDBuffer[dtype, rank, MutAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[dtype, rank, MutAnyOrigin], (ngpus * ngpus)], rank_sigs: InlineArray[LegacyUnsafePointer[Signal], 8], ctxs: List[DeviceContext], _max_num_blocks: Optional[Int] = None)

Performs all-gather across GPUs with variadic output.

Each device receives individual copies of all input buffers.

The implementation automatically selects between P2P and non-P2P paths based on hardware capabilities.

Parameters:

dtype (DType): DType - The data type of tensor elements.
rank (Int): Int - Number of dimensions in input tensors.
ngpus (Int): Int - Number of GPUs participating in all-gather.

Args:

input_buffers (InlineArray): Input buffers from each GPU.
output_buffers (InlineArray): Flat array of ngpus * ngpus output buffers. Layout: output_buffers[device_idx * ngpus + input_idx] contains device_idx's copy of input_idx's data.
rank_sigs (InlineArray): Signal pointers for P2P synchronization.
ctxs (List): List of device contexts for participating GPUs.
_max_num_blocks (Optional): Maximum number of blocks for kernel launch (optional).

Backward compatible version without rank_sigs parameter.

This version uses the naive implementation since we can't allocate signal buffers with proper lifetime in this function.

Deprecated: Use the signal_buffers overload of allgather instead.