Skip to main content
Log in

Mojo function

all_reduce_p2p_kernel

all_reduce_p2p_kernel[type: DType, rank: Int, ngpus: Int](result: UnsafePointer[SIMD[type, 1]], src_bufs: StaticTuple[NDBuffer[type, rank], ngpus], rank_sigs: StaticTuple[UnsafePointer[Signal], 8], my_rank: Int, num_elements: Int)

Kernel implementing all-reduce using peer-to-peer access between GPUs.

Arguments: result: Output buffer for reduced values src_bufs: Input buffers from all GPUs rank_sigs: Signal pointers for synchronization my_rank: Current GPU rank num_elements: Number of elements to reduce

Uses P2P access to directly read from other GPU buffers and perform reduction. Synchronizes using multi_gpu_barrier before and after reduction.

Parameters:

  • type (DType): DType - Data type of tensor elements.
  • rank (Int): Int - Number of dimensions in tensors.
  • ngpus (Int): Int - Number of GPUs participating.