Mojo function
reducescatter
reducescatter[dtype: DType, rank: Int, ngpus: Int, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_buffers: InlineArray[NDBuffer[dtype, rank, MutAnyOrigin], 1 if use_multimem else ngpus], output_buffer: NDBuffer[dtype, rank, MutAnyOrigin], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)
Per-device reducescatter operation.
Performs a reduce-scatter across multiple GPUs: each GPU reduces its assigned partition from all input buffers and writes the result to its output buffer.
This is equivalent to the reduce-scatter phase of the 2-stage allreduce algorithm.
Parameters:
- dtype (
DType): Data dtype of tensor elements. - rank (
Int): Number of dimensions in tensors. - ngpus (
Int): Number of GPUs participating. - pdl_level (
PDLLevel): Control PDL behavior for the kernel. - use_multimem (
Bool): If True, use multimem optimization (reserved for future use).
Args:
- input_buffers (
InlineArray): Input buffers from ALL GPUs (peer access required). When use_multimem is False (default), expects ngpus buffers. When use_multimem is True, expects a single buffer. - output_buffer (
NDBuffer): Output buffer for THIS GPU's partition of reduced data. Size should be approximately 1/ngpus of the input size. - rank_sigs (
InlineArray): Signal pointers for synchronization between GPUs. - ctx (
DeviceContext): Device context for THIS GPU. - _max_num_blocks (
Optional): Optional maximum number of thread blocks to launch. If not specified, uses MAX_NUM_BLOCKS_UPPER_BOUND.
Raises:
Error: If P2P access is not available between GPUs. Error: If input buffer size is not a multiple of SIMD width.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!