Mojo function
all_reduce_p2p_kernel
all_reduce_p2p_kernel[type: DType, rank: Int, ngpus: Int](result: UnsafePointer[SIMD[type, 1]], src_bufs: StaticTuple[NDBuffer[type, rank], ngpus], rank_sigs: StaticTuple[UnsafePointer[Signal], 8], my_rank: Int, num_elements: Int)
Kernel implementing all-reduce using peer-to-peer access between GPUs.
Arguments: result: Output buffer for reduced values src_bufs: Input buffers from all GPUs rank_sigs: Signal pointers for synchronization my_rank: Current GPU rank num_elements: Number of elements to reduce
Uses P2P access to directly read from other GPU buffers and perform reduction. Synchronizes using multi_gpu_barrier before and after reduction.
Parameters:
- type (
DType
): DType - Data type of tensor elements. - rank (
Int
): Int - Number of dimensions in tensors. - ngpus (
Int
): Int - Number of GPUs participating.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!