For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

allreduce

def allreduce[dtype: DType, in_layout: TensorLayout, in_origin: ImmutOrigin, rank_sigs_origin: MutOrigin, //, ngpus: Int, output_lambda: Optional[def[dtype: DType, width: SIMDSize, *, alignment: Int](Coord[_], SIMD[dtype, width]) capturing -> None] = None, pdl_level: PDLLevel = PDLLevel(), *, use_multimem: Bool = False](input_tensors: InlineArray[TileTensor[dtype, in_layout, in_origin], Int(1) if use_multimem else ngpus], output_tensor: TileTensor[dtype, Storage=output_tensor.Storage, address_space=output_tensor.address_space, linear_idx_type=output_tensor.linear_idx_type, element_size=output_tensor.element_size], rank_sigs: InlineArray[UnsafePointer[Signal, rank_sigs_origin], Int(8)], ctx: DeviceContext, _max_num_blocks: Optional[Int] = None)

Per-GPU allreduce for use in multi-threaded contexts.

Currently requires prior single-threaded call to init_comms, as thread-safe version not yet implemented.