Skip to main content

Mojo function

reducescatter

reducescatter[dtype: DType, rank: Int, ngpus: Int, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_buffers: InlineArray[NDBuffer[dtype, rank, MutAnyOrigin], 1 if use_multimem else ngpus], output_buffer: NDBuffer[dtype, rank, MutAnyOrigin], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)

Per-device reducescatter operation.

Performs a reduce-scatter across multiple GPUs: each GPU reduces its assigned partition from all input buffers and writes the result to its output buffer.

This is equivalent to the reduce-scatter phase of the 2-stage allreduce algorithm.

Parameters:

  • dtype (DType): Data dtype of tensor elements.
  • rank (Int): Number of dimensions in tensors.
  • ngpus (Int): Number of GPUs participating.
  • pdl_level (PDLLevel): Control PDL behavior for the kernel.
  • use_multimem (Bool): If True, use multimem optimization (reserved for future use).

Args:

  • input_buffers (InlineArray): Input buffers from ALL GPUs (peer access required). When use_multimem is False (default), expects ngpus buffers. When use_multimem is True, expects a single buffer.
  • output_buffer (NDBuffer): Output buffer for THIS GPU's partition of reduced data. Size should be approximately 1/ngpus of the input size.
  • rank_sigs (InlineArray): Signal pointers for synchronization between GPUs.
  • ctx (DeviceContext): Device context for THIS GPU.
  • _max_num_blocks (Optional): Optional maximum number of thread blocks to launch. If not specified, uses MAX_NUM_BLOCKS_UPPER_BOUND.

Raises:

Error: If P2P access is not available between GPUs. Error: If input buffer size is not a multiple of SIMD width.

Was this page helpful?