Mojo function

reducescatter

reducescatter[dtype: DType, rank: Int, ngpus: Int, output_lambda: Optional[elementwise_epilogue_type] = None, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_buffers: InlineArray[NDBuffer[dtype, rank, MutAnyOrigin], 1 if use_multimem else ngpus], output_buffer: NDBuffer[dtype, rank, MutAnyOrigin], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)

Per-device reducescatter operation.

Performs a reduce-scatter across multiple GPUs: each GPU reduces its assigned partition from all input buffers and writes the result to its output buffer.

This is equivalent to the reduce-scatter phase of the 2-stage allreduce algorithm.

Parameters:

dtype (DType): Data dtype of tensor elements.
rank (Int): Number of dimensions in tensors.
ngpus (Int): Number of GPUs participating.
output_lambda (Optional): Optional elementwise epilogue function. If not provided, reduced values are stored directly to output_buffer.
pdl_level (PDLLevel): Control PDL behavior for the kernel.
use_multimem (Bool): If True, use multimem optimization (reserved for future use).

Args:

input_buffers (InlineArray): Input buffers from all GPUs (peer access required). When use_multimem is False (default), expects ngpus buffers. When use_multimem is True, expects a single buffer.
output_buffer (NDBuffer): Output buffer for THIS GPU's partition of reduced data. Size should be approximately 1/ngpus of the input size.
rank_sigs (InlineArray): Signal pointers for synchronization between GPUs.
ctx (DeviceContext): Device context for THIS GPU.
_max_num_blocks (Optional): Optional maximum number of thread blocks to launch. If not specified, uses MAX_NUM_BLOCKS_UPPER_BOUND.

Raises:

Error: If P2P access is not available between GPUs. Error: If input buffer size is not a multiple of SIMD width.