For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

derive_safe_max_globals

def derive_safe_max_globals(num_k_mmas: Int) -> Int

Derive safe max_globals for uniform global load distribution.

Returns 1 if uniform distribution is safe under warp stagger, 0 otherwise.

The safety condition depends on the number of K-dimension MMA tiles (num_k_mmas). With warp stagger, WG0 runs 1 MMA phase ahead of WG1. When globals are uniformly distributed, a prefetch buffer_load_*_lds in block b writes to LDS stage h asynchronously. If block b+1's fragment loads read from the same stage, the async LDS write must complete before the ds_read — the MMA compute between them must provide enough cycles.

With num_k_mmas >= 2, each MMA block has 2+ MMAs (~32 cycles on MI355X), providing sufficient latency for async LDS writes (~20 cycles). With num_k_mmas == 1, the single MMA (~16 cycles) is insufficient.

Returns:

Int