Skip to main content

Mojo function

allreduce

allreduce[dtype: DType, ngpus: Int, in_layout: TensorLayout, in_origin: Origin[mut=in_origin.mut], out_layout: TensorLayout, output_lambda: Optional[elementwise_epilogue_type] = None, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_tensors: InlineArray[TileTensor[dtype, in_layout, in_origin], 1 if use_multimem else ngpus], output_tensor: TileTensor[dtype, out_layout, output_tensor.origin, address_space=output_tensor.address_space, linear_idx_type=output_tensor.linear_idx_type, element_size=output_tensor.element_size], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)

Per-device allreduce: one instance per GPU builds its own output.

High-level model

  • Each GPU runs one instance of this function in parallel with the others.
  • Every instance reads all inputs but writes only its own output buffer.
  • A Python-level fence is inserted across the outputs to prevent reordering.

Two execution paths

  1. P2P fast path (when peer access is available)

    • 1-stage kernel (latency-bound): each thread vector-loads from all GPUs, accumulates in higher precision, and writes directly to the result.

    • 2-stage kernel (bandwidth-bound): reduce-scatter then all-gather. Uses each GPU's rank_sigs[*] payload as a staging area for partitions.

      Diagram (per GPU r, 2-stage):

      • Stage 1: write reduced partition r into payload of rank_sigs[r].
      • Stage 2: gather partitions from all peers' payloads into out_r.
  2. Naive fallback (no P2P)

    • For GPU r: create local accumulator A_r, allocate a temporary buffer S_r, copy each peer input into S_r and accumulate into A_r, then apply the epilogue into out_r.

      Diagram (per GPU r, naive): in_r -> A_r += in_r; for i!=r: in_i -> tmp_r -> A_r += tmp_r; A_r -> out_r

Notes:

  • Inputs must have identical shape/dtype across GPUs.
  • Signal buffers must be sized at least size_of(Signal) + payload_bytes for the P2P 2-stage path, where payload_bytes equals the input tensor bytecount.
  • The naive path is automatically selected if P2P cannot be enabled.
  • The use_multimem parameter requires P2P access between GPUs.

Parameters:

  • dtype (DType): Data type of the tensor elements.
  • ngpus (Int): Number of GPUs participating in the allreduce.
  • in_layout (TensorLayout): Layout of the input TileTensors.
  • in_origin (Origin): Origin of the input TileTensors.
  • out_layout (TensorLayout): Layout of the output TileTensor.
  • output_lambda (Optional): Elementwise epilogue applied on the device result.
  • pdl_level (PDLLevel): Controls PDL behavior for P2P kernels.
  • use_multimem (Bool): Whether to use multimem mode for improved performance.

Args:

  • input_tensors (InlineArray): Inputs from ALL GPUs as TileTensors.
  • output_tensor (TileTensor): Output for THIS GPU as a TileTensor.
  • rank_sigs (InlineArray): Per-GPU Signal pointers.
  • ctx (DeviceContext): Device context for THIS GPU.
  • _max_num_blocks (Optional): Optional grid limit.

Was this page helpful?