Mojo module

allreduce

Multi-GPU allreduce implementation for efficient tensor reduction across GPUs.

This module provides an optimized implementation of allreduce operations across multiple GPUs, supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation automatically selects between two approaches based on hardware capabilities:

P2P-based implementation (when P2P access is available):
- Uses direct GPU-to-GPU memory access for better performance
- Implements both single-stage and two-stage algorithms:
  - Single-stage for latency-bound transfers (small tensors)
  - Two-stage (reduce-scatter + all-gather) for bandwidth-bound transfers (large tensors)
- Optimized for NVLink bandwidth utilization
- Uses vectorized memory access and higher precision accumulation
Non-P2P fallback implementation:
- Copies data through host memory when direct GPU access isn't possible
- Simple but functional approach for systems without P2P support

The implementation is tuned for common GPU architectures (A100, H100) and includes parameters that can be adjusted for different hardware configurations.

Per-Device Architecture

The allreduce operation follows a per-device execution model:

Single-Device Instances: Each GPU runs its own instance of the allreduce operation.
Parallel Execution: The Python/Graph API layer is responsible for:
- Creating one allreduce op instance per participating GPU.
- Ensuring all instances execute in parallel.
- Ensuring correctness by staging mo.fence.
Device Affinity: Each allreduce instance:
- Executes on its assigned GPU (specified via device context).
- Reads from all GPUs' input buffers (requires P2P access).
- Writes only to its own output buffer.
- Uses the same synchronization signals as other instances.
Requirements:
- Peer-to-peer access must be enabled between all participating GPUs.
- All instances must launch before any can complete (for synchronization).
- The device context determines which GPU executes each instance.

Limitations:

Maximum of 8 GPUs supported.
Multimem mode still requires the element count to be a multiple of SIMD width.
All input/output buffers must have identical shapes.

Non-multimem 1-stage P2P and naive epilogue accept arbitrary N: when N is a multiple of device SIMD width, the 1-stage kernel uses the same vectorized _load_reduce grid loop as before; otherwise it runs that loop on the SIMD-aligned prefix and finishes the last < simd_width elements with a grid-strided scalar reduce-store. The naive epilogue kernel uses the same SIMD-prefix + scalar-tail pattern for accum → out.

Visual Overview

1-Stage P2P (latency-bound)

Each GPU r reads its portion from every peer buffer directly (via P2P), accumulates, then writes to its result using the epilogue:
```
GPU r (result_r)
src_tensors[0] ─┐
src_tensors[1] ─┼──► Σ (high-precision accum) ──► output_lambda ──► result_r
...         ─┘
```
Notes:
- Non-multimem: SIMD-vector _load_reduce on the aligned prefix; optional scalar tail when N is not a multiple of SIMD width. Multimem: unchanged full-vector loads (N must be SIMD-aligned).
- Good for small/latency-bound tensors.
2-Stage P2P (bandwidth-bound)

Stage 1 (reduce-scatter): Each GPU r reduces its assigned partition and writes into its own signal payload (the bytes after the Signal header).
```
src_tensors[*]  ──►  reduce(partition r)  ──►  rank_sigs[r].payload  (per-GPU)
```
Stage 2 (all-gather): Each GPU r gathers all partitions from peers' payloads and writes them to its result using the epilogue.
```
[payload_0], [payload_1], ..., [payload_{ngpus-1}]  ──►  result_r (via output_lambda)
```

For the naive allreduce (no P2P) per-device flow and staging details, see the _allreduce_naive_single docstring in this file.

`comptime` values

`allreduce_tuning_table`

comptime allreduce_tuning_table = Table(List(CommTuningConfig(-1, -1, StringSlice("sm_90a"), 216), CommTuningConfig(4, 134217728, StringSlice("sm_90a"), 232), CommTuningConfig(-1, -1, StringSlice("sm_100a"), 512), CommTuningConfig(2, 8388608, StringSlice("sm_100a"), 512), CommTuningConfig(2, 16777216, StringSlice("sm_100a"), 512), CommTuningConfig(2, 33554432, StringSlice("sm_100a"), 512), CommTuningConfig(2, 67108864, StringSlice("sm_100a"), 512), CommTuningConfig(2, 134217728, StringSlice("sm_100a"), 512), CommTuningConfig(4, 8388608, StringSlice("sm_100a"), 512), CommTuningConfig(4, 16777216, StringSlice("sm_100a"), 512), CommTuningConfig(4, 33554432, StringSlice("sm_100a"), 512), CommTuningConfig(4, 67108864, StringSlice("sm_100a"), 512), CommTuningConfig(4, 134217728, StringSlice("sm_100a"), 512), CommTuningConfig(-1, -1, StringSlice("sm_103a"), 512), CommTuningConfig(-1, -1, StringSlice("CDNA3"), 32), CommTuningConfig(-1, -1, StringSlice("CDNA4"), 64), CommTuningConfig(8, 1048576, StringSlice("CDNA4"), 64), CommTuningConfig(8, 2147483648, StringSlice("CDNA4"), 44), CommTuningConfig(-1, -1, StringSlice("default"), 512), __list_literal__=NoneType(None)), String("allreduce_table"))

`elementwise_epilogue_type`

comptime elementwise_epilogue_type = def[dtype: DType, width: Int, *, alignment: Int, ?, .element_types.values0x2: KGENParamList[CoordLike], .element_types`0x3: TypeList[values]](Coord[element_types], SIMD[dtype, width]) capturing -> None ``

Functions

allreduce: Per-device allreduce: one instance per GPU builds its own output.

Per-Device Architecture​

Visual Overview​

comptime values​

allreduce_tuning_table​

elementwise_epilogue_type​

Functions​