Skip to main content

Mojo function

allgather

allgather[dtype: DType, rank: Int, ngpus: Int](input_buffers: InlineArray[NDBuffer[dtype, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[dtype, rank, MutableAnyOrigin], (ngpus * ngpus)], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext], _max_num_blocks: Optional[Int] = None)

Performs all-gather across GPUs with variadic output.

Each device receives individual copies of all input buffers.

The implementation automatically selects between P2P and non-P2P paths based on hardware capabilities.

Parameters:

  • dtype (DType): DType - The data type of tensor elements.
  • rank (Int): Int - Number of dimensions in input tensors.
  • ngpus (Int): Int - Number of GPUs participating in all-gather.

Args:

  • input_buffers (InlineArray): Input buffers from each GPU.
  • output_buffers (InlineArray): Flat array of ngpus * ngpus output buffers. Layout: output_buffers[device_idx * ngpus + input_idx] contains device_idx's copy of input_idx's data.
  • rank_sigs (InlineArray): Signal pointers for P2P synchronization.
  • ctxs (List): List of device contexts for participating GPUs.
  • _max_num_blocks (Optional): Maximum number of blocks for kernel launch (optional).

allgather[dtype: DType, rank: Int, ngpus: Int](input_buffers: InlineArray[NDBuffer[dtype, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[dtype, rank, MutableAnyOrigin], (ngpus * ngpus)], ctxs: List[DeviceContext])

Backward compatible version without rank_sigs parameter.

This version uses the naive implementation since we can't allocate signal buffers with proper lifetime in this function.

Deprecated:

Use the signal_buffers overload of allgather instead.

Was this page helpful?