Skip to main content

Mojo module

scatter

Multi-GPU scatter+broadcast kernel implementation.

Distributes different data chunks from a root GPU to multiple device groups. Each group (DP replica) gets a different chunk, and all devices within a group (TP devices) get the same chunk.

Example with DP=4, TP=2, 8 GPUs:

  • Chunk 0 -> GPU 0 and GPU 1 (Replica A)
  • Chunk 1 -> GPU 2 and GPU 3 (Replica B)
  • Chunk 2 -> GPU 4 and GPU 5 (Replica C)
  • Chunk 3 -> GPU 6 and GPU 7 (Replica D)

Uses a pull-based approach: each GPU reads its chunk from root via P2P.

Functionsโ€‹

Was this page helpful?