IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

broadcast_2stage

def broadcast_2stage[dtype: DType, in_layout: TensorLayout, in_origin: Origin[mut=in_origin.mut], //, ngpus: Int, pdl_level: PDLLevel](input_tensor: TileTensor[dtype, in_layout, in_origin], output_tensor: TileTensor[dtype, in_layout], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], Int(8)], ctx: DeviceContext, root: Int, _max_num_blocks: Optional[Int] = None)

Two-stage broadcast: scatter from root, then allgather among all GPUs.

Note: This path is only used with 3+ GPUs. With 2 GPUs, broadcast uses the simpler 1-stage path for better performance.

This algorithm achieves better bandwidth than simple pull broadcast by:

  1. Stage 1 (Scatter): Each GPU reads 1/ngpus of the data from root and writes to its payload buffer, utilizing root's outbound GPU link bandwidth.
  2. Stage 2 (Allgather): All GPUs gather from each other in parallel, with each GPU reading (ngpus-1) chunks from other GPUs' payloads.

All GPUs (including root) participate uniformly in both stages, which better utilizes root's GPU link bandwidth and simplifies partitioning.

IMPORTANT: Signal buffers must be sized to hold at least: size_of(Signal) + (num_elements / ngpus) * size_of(dtype) This is the payload space needed for each GPU's chunk.

Parameters:

  • ​dtype (DType): Data dtype of tensor elements.
  • ​in_layout (TensorLayout): Layout of the input TileTensor.
  • ​in_origin (Origin[mut=in_origin.mut]): Origin of the input TileTensor.
  • ​ngpus (Int): Number of GPUs participating.
  • ​pdl_level (PDLLevel): Control PDL behavior for the kernel.

Args: