Mojo module

scatter

Multi-GPU scatter+broadcast kernel implementation.

Distributes different data chunks from a root GPU to multiple device groups. Each group (DP replica) gets a different chunk, and all devices within a group (TP devices) get the same chunk.

Example with DP=4, TP=2, 8 GPUs:

Chunk 0 -> GPU 0 and GPU 1 (Replica A)
Chunk 1 -> GPU 2 and GPU 3 (Replica B)
Chunk 2 -> GPU 4 and GPU 5 (Replica C)
Chunk 3 -> GPU 6 and GPU 7 (Replica D)

Uses a pull-based approach: each GPU reads its chunk from root via P2P.

Functions

scatter: Pull-based scatter+broadcast.
scatter_pull_kernel: Pull-based scatter+broadcast: each GPU reads its chunk from root.

Functions​

Functions