Skip to main content

Mojo function

reducescatter

reducescatter[dtype: DType, ngpus: Int, in_layout: TensorLayout, in_origin: Origin[mut=in_origin.mut], output_lambda: Optional[def[dtype: DType, width: Int, *, alignment: Int, ?, .element_types`0x3: KGENParamList[CoordLike]](Coord[element_types], SIMD[dtype, width]) capturing -> None] = None, *, axis: Int = 0, use_multimem: Bool = False](input_buffers: InlineArray[TileTensor[dtype, in_layout, in_origin], 1 if use_multimem else ngpus], output_buffer: TileTensor[dtype, output_buffer.LayoutType, output_buffer.origin, address_space=output_buffer.address_space, linear_idx_type=output_buffer.linear_idx_type, element_size=output_buffer.element_size], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)

Per-device reducescatter operation with axis-aware scatter.

Performs a reduce-scatter across multiple GPUs: each GPU reduces its assigned partition from all input buffers and writes the result to its output buffer.

Parameters:

  • dtype (DType): Data dtype of tensor elements.
  • ngpus (Int): Number of GPUs participating.
  • in_layout (TensorLayout): Layout of the input TileTensors.
  • in_origin (Origin): Origin of the input TileTensors.
  • output_lambda (Optional): Optional elementwise epilogue function. If not provided, reduced values are stored directly to output_buffer.
  • axis (Int): Scatter axis. 0 to scatter along rows (default), 1 to scatter along columns. Requires 2D row-major inputs when axis >= 0.
  • use_multimem (Bool): If True, use hardware-accelerated multimem reduction. Currently only valid with 1D input. TODO(KERN-2526): generalize.

Args:

  • input_buffers (InlineArray): Input TileTensors from all GPUs (peer access required). When use_multimem is True, a single multimem-mapped TileTensor.
  • output_buffer (TileTensor): Output TileTensor for THIS GPU's partition of reduced data.
  • rank_sigs (InlineArray): Signal pointers for synchronization between GPUs.
  • ctx (DeviceContext): Device context for THIS GPU.
  • _max_num_blocks (Optional): Optional maximum number of thread blocks to launch. If not specified, uses MAX_NUM_BLOCKS_UPPER_BOUND.

Raises:

Error: If P2P access is not available between GPUs. Error: If input buffer size is not a multiple of SIMD width.

Was this page helpful?