Mojo module
allgather
Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer.
This module provides an optimized implementation of allgather operations across multiple GPUs, supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation automatically selects between approaches based on hardware capabilities:
-
P2P-based implementation (when P2P access is available):
- Uses direct GPU-to-GPU memory access for better performance.
- Optimized for NVLink and xGMI bandwidth utilization.
- Uses vectorized memory access.
-
Non-P2P fallback implementation:
- Copies data through device memory when direct GPU access isn't possible.
- Simple but functional approach for systems without P2P support.
comptime valuesβ
allgather_tuning_tableβ
comptime allgather_tuning_table = Table(List(CommTuningConfig(-1, -1, StringSlice("sm_90a"), 216), CommTuningConfig(-1, -1, StringSlice("sm_100a"), 512), CommTuningConfig(-1, -1, StringSlice("sm_103a"), 512), CommTuningConfig(-1, -1, StringSlice("CDNA4"), 216), CommTuningConfig(-1, -1, StringSlice("default"), 512), __list_literal__=Tuple()), String("allgather_table"))
Functionsβ
- β
allgather: Per-device all-gather: one instance per GPU builds its own outputs.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!