For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
derive_safe_max_globals
derive_safe_max_globals(num_k_mmas: Int) -> Int
Derive safe max_globals for uniform global load distribution.
Returns 1 if uniform distribution is safe under warp stagger, 0 otherwise.
The safety condition depends on the number of K-dimension MMA tiles (num_k_mmas). With warp stagger, WG0 runs 1 MMA phase ahead of WG1. When globals are uniformly distributed, a prefetch buffer_load_*_lds in block b writes to LDS stage h asynchronously. If block b+1's fragment loads read from the same stage, the async LDS write must complete before the ds_read — the MMA compute between them must provide enough cycles.
With num_k_mmas >= 2, each MMA block has 2+ MMAs (~32 cycles on MI355X), providing sufficient latency for async LDS writes (~20 cycles). With num_k_mmas == 1, the single MMA (~16 cycles) is insufficient.
Returns:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!