Mojo function
allgather
allgather[dtype: DType, ngpus: Int, in_layout: TensorLayout, in_origin: Origin[mut=in_origin.mut], out_layout: TensorLayout, out_origin: MutOrigin, pdl_level: PDLLevel = PDLLevel()](input_buffers: InlineArray[TileTensor[dtype, in_layout, in_origin], ngpus], output_buffers: InlineArray[TileTensor[dtype, out_layout, out_origin], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, my_rank: Int, _max_num_blocks: Optional[Int] = None)
Per-device all-gather: one instance per GPU builds its own outputs.
Each instance reads all input buffers and writes to its own ngpus output buffers. The caller is responsible for launching one instance per device in parallel (e.g. via _launch_device_collective).
The implementation automatically selects between P2P and non-P2P paths based on hardware capabilities.
Parameters:
- dtype (
DType): Data type of the tensor elements. - ngpus (
Int): Number of GPUs participating in all-gather. - in_layout (
TensorLayout): Layout of the input TileTensors. - in_origin (
Origin): Origin of the input TileTensors. - out_layout (
TensorLayout): Layout of the output TileTensors. - out_origin (
MutOrigin): Origin of the output TileTensors. - pdl_level (
PDLLevel): Controls PDL behavior for P2P kernels.
Args:
- input_buffers (
InlineArray): Input buffers from ALL GPUs as TileTensors. - output_buffers (
InlineArray): Output buffers for THIS GPU (ngpus TileTensors). output_buffers[i] receives the data from GPU i. - rank_sigs (
InlineArray): Per-GPU Signal pointers for P2P synchronization. - ctx (
DeviceContext): Device context for THIS GPU. - my_rank (
Int): Index of this GPU among the participants. - _max_num_blocks (
Optional): Maximum number of blocks for kernel launch (optional).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!