Mojo function
allreduce
allreduce[dtype: DType, ngpus: Int, in_layout: TensorLayout, in_origin: Origin[mut=in_origin.mut], out_layout: TensorLayout, output_lambda: Optional[elementwise_epilogue_type] = None, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_tensors: InlineArray[TileTensor[dtype, in_layout, in_origin], 1 if use_multimem else ngpus], output_tensor: TileTensor[dtype, out_layout, output_tensor.origin, address_space=output_tensor.address_space, linear_idx_type=output_tensor.linear_idx_type, element_size=output_tensor.element_size], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)
Per-device allreduce: one instance per GPU builds its own output.
High-level model
- Each GPU runs one instance of this function in parallel with the others.
- Every instance reads all inputs but writes only its own output buffer.
- A Python-level fence is inserted across the outputs to prevent reordering.
Two execution paths
-
P2P fast path (when peer access is available)
-
1-stage kernel (latency-bound): each thread vector-loads from all GPUs, accumulates in higher precision, and writes directly to the result.
-
2-stage kernel (bandwidth-bound): reduce-scatter then all-gather. Uses each GPU's
rank_sigs[*]payload as a staging area for partitions.Diagram (per GPU r, 2-stage):
- Stage 1: write reduced partition r into payload of
rank_sigs[r]. - Stage 2: gather partitions from all peers' payloads into
out_r.
- Stage 1: write reduced partition r into payload of
-
-
Naive fallback (no P2P)
-
For GPU r: create local accumulator A_r, allocate a temporary buffer S_r, copy each peer input into S_r and accumulate into A_r, then apply the epilogue into
out_r.Diagram (per GPU r, naive): in_r -> A_r += in_r; for i!=r: in_i -> tmp_r -> A_r += tmp_r; A_r -> out_r
-
Notes:
- Inputs must have identical shape/dtype across GPUs.
- Signal buffers must be sized at least
size_of(Signal) + payload_bytesfor the P2P 2-stage path, wherepayload_bytesequals the input tensor bytecount. - The naive path is automatically selected if P2P cannot be enabled.
- The
use_multimemparameter requires P2P access between GPUs.
Parameters:
- dtype (
DType): Data type of the tensor elements. - ngpus (
Int): Number of GPUs participating in the allreduce. - in_layout (
TensorLayout): Layout of the input TileTensors. - in_origin (
Origin): Origin of the input TileTensors. - out_layout (
TensorLayout): Layout of the output TileTensor. - output_lambda (
Optional): Elementwise epilogue applied on the device result. - pdl_level (
PDLLevel): Controls PDL behavior for P2P kernels. - use_multimem (
Bool): Whether to use multimem mode for improved performance.
Args:
- input_tensors (
InlineArray): Inputs from ALL GPUs as TileTensors. - output_tensor (
TileTensor): Output for THIS GPU as a TileTensor. - rank_sigs (
InlineArray): Per-GPU Signal pointers. - ctx (
DeviceContext): Device context for THIS GPU. - _max_num_blocks (
Optional): Optional grid limit.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!