Skip to main content
Log in

Mojo module

all_reduce

Multi-GPU allreduce implementation for efficient tensor reduction across GPUs.

This module provides an optimized implementation of allreduce operations across multiple GPUs, supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation automatically selects between two approaches based on hardware capabilities:

  1. P2P-based implementation (when P2P access is available):

    • Uses direct GPU-to-GPU memory access for better performance
    • Implements both single-stage and two-stage algorithms:
      • Single-stage for latency-bound transfers (small tensors)
      • Two-stage (reduce-scatter + all-gather) for bandwidth-bound transfers (large tensors)
    • Optimized for NVLink bandwidth utilization
    • Uses vectorized memory access and higher precision accumulation
  2. Non-P2P fallback implementation:

    • Copies data through host memory when direct GPU access isn't possible
    • Simple but functional approach for systems without P2P support

The implementation is tuned for common GPU architectures (A100, H100) and includes parameters that can be adjusted for different hardware configurations.

Limitations:

  • Number of elements must be a multiple of SIMD width
  • Maximum of 8 GPUs supported
  • All input/output buffers must have identical shapes

Aliases

  • MAX_GPUS = 8: Maximum number of GPUs supported in the allreduce implementation. This constant sets the upper bound for the number of GPUS supported in this algorithm.
  • MAX_NUM_BLOCKS_DEFAULT = 128: Maximum number of thread blocks to use for reduction kernels. This value has been empirically optimized through grid search across different GPU architectures. While this value is optimal for A100 GPUs, H100 GPUs may benefit from more blocks to fully saturate NVLink bandwidth.

Structs

  • Signal: A synchronization primitive for coordinating GPU thread blocks across multiple devices.

Functions

  • all_reduce: Performs an allreduce operation across multiple GPUs.
  • can_enable_p2p: If peer-to-peer access is supported, enables it between all GPU pairs.