Skip to main content

Mojo function

allreduce

allreduce[dtype: DType, in_layout: TensorLayout, in_origin: ImmutOrigin, rank_sigs_origin: MutOrigin, //, ngpus: Int, output_lambda: Optional[def[dtype: DType, width: Int, *, alignment: Int, ?, .element_types`0x2: KGENParamList[CoordLike]](Coord[element_types], SIMD[dtype, width]) capturing -> None] = None, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_tensors: InlineArray[TileTensor[dtype, in_layout, in_origin], 1 if use_multimem else ngpus], output_tensor: TileTensor[dtype, output_tensor.LayoutType, output_tensor.origin, address_space=output_tensor.address_space, linear_idx_type=output_tensor.linear_idx_type, element_size=output_tensor.element_size], rank_sigs: InlineArray[UnsafePointer[Signal, rank_sigs_origin], 8], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)

Per-GPU allreduce for use in multi-threaded contexts.

Currently requires prior single-threaded call to init_comms, as thread-safe version not yet implemented.

Was this page helpful?