Mojo function
reducescatter
reducescatter[dtype: DType, ngpus: Int, in_layout: TensorLayout, in_origin: Origin[mut=in_origin.mut], output_lambda: Optional[def[dtype: DType, width: Int, *, alignment: Int, ?, .element_types`0x3: KGENParamList[CoordLike]](Coord[element_types], SIMD[dtype, width]) capturing -> None] = None, *, axis: Int = 0, use_multimem: Bool = False](input_buffers: InlineArray[TileTensor[dtype, in_layout, in_origin], 1 if use_multimem else ngpus], output_buffer: TileTensor[dtype, output_buffer.LayoutType, output_buffer.origin, address_space=output_buffer.address_space, linear_idx_type=output_buffer.linear_idx_type, element_size=output_buffer.element_size], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)
Per-device reducescatter operation with axis-aware scatter.
Performs a reduce-scatter across multiple GPUs: each GPU reduces its assigned partition from all input buffers and writes the result to its output buffer.
Parameters:
- dtype (
DType): Data dtype of tensor elements. - ngpus (
Int): Number of GPUs participating. - in_layout (
TensorLayout): Layout of the input TileTensors. - in_origin (
Origin): Origin of the input TileTensors. - output_lambda (
Optional): Optional elementwise epilogue function. If not provided, reduced values are stored directly to output_buffer. - axis (
Int): Scatter axis. 0 to scatter along rows (default), 1 to scatter along columns. Requires 2D row-major inputs when axis >= 0. - use_multimem (
Bool): If True, use hardware-accelerated multimem reduction. Currently only valid with 1D input. TODO(KERN-2526): generalize.
Args:
- input_buffers (
InlineArray): Input TileTensors from all GPUs (peer access required). When use_multimem is True, a single multimem-mapped TileTensor. - output_buffer (
TileTensor): Output TileTensor for THIS GPU's partition of reduced data. - rank_sigs (
InlineArray): Signal pointers for synchronization between GPUs. - ctx (
DeviceContext): Device context for THIS GPU. - _max_num_blocks (
Optional): Optional maximum number of thread blocks to launch. If not specified, uses MAX_NUM_BLOCKS_UPPER_BOUND.
Raises:
Error: If P2P access is not available between GPUs. Error: If input buffer size is not a multiple of SIMD width.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!