Mojo module
scatter
Multi-GPU scatter+broadcast kernel implementation.
Distributes different data chunks from a root GPU to multiple device groups. Each group (DP replica) gets a different chunk, and all devices within a group (TP devices) get the same chunk.
Example with DP=4, TP=2, 8 GPUs:
- Chunk 0 -> GPU 0 and GPU 1 (Replica A)
- Chunk 1 -> GPU 2 and GPU 3 (Replica B)
- Chunk 2 -> GPU 4 and GPU 5 (Replica C)
- Chunk 3 -> GPU 6 and GPU 7 (Replica D)
Uses a pull-based approach: each GPU reads its chunk from root via P2P.
Functionsโ
- โ
scatter: Pull-based scatter+broadcast. - โ
scatter_pull_kernel: Pull-based scatter+broadcast: each GPU reads its chunk from root.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!