Skip to main content

Mojo function

broadcast_multimem_kernel

broadcast_multimem_kernel[dtype: DType, Layout: TensorLayout, BLOCK_SIZE: Int, ngpus: Int, simd_width: Int = simd_width_of[dtype, get_gpu_target()](), pdl_level: PDLLevel = PDLLevel()](output: TileTensor[dtype, Layout, MutAnyOrigin], input: TileTensor[dtype, Layout, ImmutAnyOrigin], rank_sigs: InlineArray[UnsafePointer[Signal, MutAnyOrigin], 8], my_rank: Int, root: Int)

Broadcast kernel using multimem.st for multicast writes.

Root GPU writes to multicast address, data appears on all GPUs. Only root performs the stores; other GPUs just participate in barriers.

Was this page helpful?