IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

allgather

Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer.

This module provides an optimized implementation of allgather operations across multiple GPUs, supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation automatically selects between approaches based on hardware capabilities:

  1. P2P-based implementation (when P2P access is available):

    • Uses direct GPU-to-GPU memory access for better performance.
    • Optimized for NVLink and xGMI bandwidth utilization.
    • Uses vectorized memory access.
  2. Non-P2P fallback implementation:

    • Copies data through device memory when direct GPU access isn't possible.
    • Simple but functional approach for systems without P2P support.

comptime values​

allgather_tuning_table​

comptime allgather_tuning_table = Table(List(DefaultCommTuningConfig(Int(-1), Int(-1), StringSlice("sm_90a"), Int(216)), DefaultCommTuningConfig(Int(-1), Int(-1), StringSlice("sm_100a"), Int(512)), DefaultCommTuningConfig(Int(-1), Int(-1), StringSlice("sm_103a"), Int(512)), DefaultCommTuningConfig(Int(-1), Int(-1), StringSlice("CDNA4"), Int(216)), DefaultCommTuningConfig(Int(-1), Int(-1), StringSlice("default"), Int(512)), __list_literal__=NoneType(None)), String("allgather_table"))

Functions​

  • ​allgather: Per-device all-gather: one instance per GPU builds its own outputs.