Mojo module
allreduce
Multi-GPU allreduce implementation for efficient tensor reduction across GPUs.
This module provides an optimized implementation of allreduce operations across multiple GPUs, supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation automatically selects between two approaches based on hardware capabilities:
-
P2P-based implementation (when P2P access is available):
- Uses direct GPU-to-GPU memory access for better performance
- Implements both single-stage and two-stage algorithms:
- Single-stage for latency-bound transfers (small tensors)
- Two-stage (reduce-scatter + all-gather) for bandwidth-bound transfers (large tensors)
- Optimized for NVLink bandwidth utilization
- Uses vectorized memory access and higher precision accumulation
-
Non-P2P fallback implementation:
- Copies data through host memory when direct GPU access isn't possible
- Simple but functional approach for systems without P2P support
The implementation is tuned for common GPU architectures (A100, H100) and includes parameters that can be adjusted for different hardware configurations.
Per-Device Architectureβ
The allreduce operation follows a per-device execution model:
-
Single-Device Instances: Each GPU runs its own instance of the allreduce operation.
-
Parallel Execution: The Python/Graph API layer is responsible for:
- Creating one allreduce op instance per participating GPU.
- Ensuring all instances execute in parallel.
- Ensuring correctness by staging mo.fence.
-
Device Affinity: Each allreduce instance:
- Executes on its assigned GPU (specified via device context).
- Reads from all GPUs' input buffers (requires P2P access).
- Writes only to its own output buffer.
- Uses the same synchronization signals as other instances.
-
Requirements:
- Peer-to-peer access must be enabled between all participating GPUs.
- All instances must launch before any can complete (for synchronization).
- The device context determines which GPU executes each instance.
Limitations:
- Maximum of 8 GPUs supported.
- Multimem mode still requires the element count to be a multiple of SIMD width.
- All input/output buffers must have identical shapes.
Non-multimem 1-stage P2P and naive epilogue accept arbitrary N: when
N is a multiple of device SIMD width, the 1-stage kernel uses the same
vectorized _load_reduce grid loop as before; otherwise it runs that loop on
the SIMD-aligned prefix and finishes the last < simd_width elements with a
grid-strided scalar reduce-store. The naive epilogue kernel uses the same
SIMD-prefix + scalar-tail pattern for accum β out.
Visual Overviewβ
-
1-Stage P2P (latency-bound)
Each GPU r reads its portion from every peer buffer directly (via P2P), accumulates, then writes to its result using the epilogue:
GPU r (result_r) src_tensors[0] ββ src_tensors[1] ββΌβββΊ Ξ£ (high-precision accum) βββΊ output_lambda βββΊ result_r ... ββNotes:
- Non-multimem: SIMD-vector
_load_reduceon the aligned prefix; optional scalar tail whenNis not a multiple of SIMD width. Multimem: unchanged full-vector loads (Nmust be SIMD-aligned). - Good for small/latency-bound tensors.
- Non-multimem: SIMD-vector
-
2-Stage P2P (bandwidth-bound)
Stage 1 (reduce-scatter): Each GPU r reduces its assigned partition and writes into its own signal payload (the bytes after the Signal header).
src_tensors[*] βββΊ reduce(partition r) βββΊ rank_sigs[r].payload (per-GPU)Stage 2 (all-gather): Each GPU r gathers all partitions from peers' payloads and writes them to its result using the epilogue.
[payload_0], [payload_1], ..., [payload_{ngpus-1}] βββΊ result_r (via output_lambda)
For the naive allreduce (no P2P) per-device flow and staging details, see the
_allreduce_naive_single docstring in this file.
comptime valuesβ
allreduce_tuning_tableβ
comptime allreduce_tuning_table = Table(List(CommTuningConfig(-1, -1, StringSlice("sm_90a"), 216), CommTuningConfig(4, 134217728, StringSlice("sm_90a"), 232), CommTuningConfig(-1, -1, StringSlice("sm_100a"), 512), CommTuningConfig(2, 8388608, StringSlice("sm_100a"), 512), CommTuningConfig(2, 16777216, StringSlice("sm_100a"), 512), CommTuningConfig(2, 33554432, StringSlice("sm_100a"), 512), CommTuningConfig(2, 67108864, StringSlice("sm_100a"), 512), CommTuningConfig(2, 134217728, StringSlice("sm_100a"), 512), CommTuningConfig(4, 8388608, StringSlice("sm_100a"), 512), CommTuningConfig(4, 16777216, StringSlice("sm_100a"), 512), CommTuningConfig(4, 33554432, StringSlice("sm_100a"), 512), CommTuningConfig(4, 67108864, StringSlice("sm_100a"), 512), CommTuningConfig(4, 134217728, StringSlice("sm_100a"), 512), CommTuningConfig(-1, -1, StringSlice("sm_103a"), 512), CommTuningConfig(-1, -1, StringSlice("CDNA3"), 32), CommTuningConfig(-1, -1, StringSlice("CDNA4"), 64), CommTuningConfig(8, 1048576, StringSlice("CDNA4"), 64), CommTuningConfig(8, 2147483648, StringSlice("CDNA4"), 44), CommTuningConfig(-1, -1, StringSlice("default"), 512), __list_literal__=NoneType(None)), String("allreduce_table"))
elementwise_epilogue_typeβ
comptime elementwise_epilogue_type = def[dtype: DType, width: Int, *, alignment: Int, ?, .element_types.values0x2: KGENParamList[CoordLike], .element_types`0x3: TypeList[values]](Coord[element_types], SIMD[dtype, width]) capturing -> None ``
Functionsβ
- β
allreduce: Per-device allreduce: one instance per GPU builds its own output.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!