Mojo function

allreduce

allreduce[dtype: DType, ngpus: Int, in_layout: TensorLayout, in_origin: Origin[mut=in_origin.mut], out_layout: TensorLayout, output_lambda: Optional[elementwise_epilogue_type] = None, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_tensors: InlineArray[TileTensor[dtype, in_layout, in_origin], 1 if use_multimem else ngpus], output_tensor: TileTensor[dtype, out_layout, output_tensor.origin, address_space=output_tensor.address_space, linear_idx_type=output_tensor.linear_idx_type, element_size=output_tensor.element_size], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)

Per-device allreduce: one instance per GPU builds its own output.

High-level model

Each GPU runs one instance of this function in parallel with the others.
Every instance reads all inputs but writes only its own output buffer.
A Python-level fence is inserted across the outputs to prevent reordering.

Two execution paths

P2P fast path (when peer access is available)
- 1-stage kernel (latency-bound): each thread vector-loads from all GPUs, accumulates in higher precision, and writes directly to the result.
- 2-stage kernel (bandwidth-bound): reduce-scatter then all-gather. Uses each GPU's rank_sigs[*] payload as a staging area for partitions.
  
  Diagram (per GPU r, 2-stage):
  - Stage 1: write reduced partition r into payload of rank_sigs[r].
  - Stage 2: gather partitions from all peers' payloads into out_r.
Naive fallback (no P2P)
- For GPU r: create local accumulator A_r, allocate a temporary buffer S_r, copy each peer input into S_r and accumulate into A_r, then apply the epilogue into out_r.
  
  Diagram (per GPU r, naive): in_r -> A_r += in_r; for i!=r: in_i -> tmp_r -> A_r += tmp_r; A_r -> out_r

Notes:

Inputs must have identical shape/dtype across GPUs.
Signal buffers must be sized at least size_of(Signal) + payload_bytes for the P2P 2-stage path, where payload_bytes equals the input tensor bytecount.
The naive path is automatically selected if P2P cannot be enabled.
The use_multimem parameter requires P2P access between GPUs.

Parameters:

dtype (DType): Data type of the tensor elements.
ngpus (Int): Number of GPUs participating in the allreduce.
in_layout (TensorLayout): Layout of the input TileTensors.
in_origin (Origin): Origin of the input TileTensors.
out_layout (TensorLayout): Layout of the output TileTensor.
output_lambda (Optional): Elementwise epilogue applied on the device result.
pdl_level (PDLLevel): Controls PDL behavior for P2P kernels.
use_multimem (Bool): Whether to use multimem mode for improved performance.

Args:

input_tensors (InlineArray): Inputs from ALL GPUs as TileTensors.
output_tensor (TileTensor): Output for THIS GPU as a TileTensor.
rank_sigs (InlineArray): Per-GPU Signal pointers.
ctx (DeviceContext): Device context for THIS GPU.
_max_num_blocks (Optional): Optional grid limit.