For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
derive_waits_from_blocks
derive_waits_from_blocks(program: PipelineProgram, config: PipelineConfig, lgkm_per_a: Int = 0, lgkm_per_b: Int = 0) -> Tuple[Int, Int]
Derive wait counts from the finalized block structure.
Unlike the old derive_wait_counts (which operated on the flat LDG ordering before block construction), this works on the final PipelineProgram after CSP ordering AND post-construction redistribution. The counts always reflect the actual block layout.
Per-channel lgkm cost is read from the config first
(config.lgkm_per_channel(channel)); the legacy lgkm_per_a/b
parameters are only consulted when the config has them unset
(= 0), preserving backward compat for callers that still thread
them through ScheduleConfig.
wait_lgkm_first: lgkm ops in block 0's pre_ops (fragment loads issued before the first barrier/MMA). Ensures fragment loads complete before the MMA consumes their register values.
wait_vm_last: at the last block's pre_sync, all global loads from all blocks in this half have been issued (globals come before pre_sync in the block layout). Completion loads must have finished; prefetch loads may remain outstanding. wait_vm = total_vm_in_half - completion_vm.
Completion detection uses STAGE-BASED logic (not the k_offset-based global_load_prefetch flag, which serves the prologue). A load is completion if its stage matches the OTHER half's read stage (stage != half), because the other half's fragment loads will read from that LDS stage after the half-boundary barrier. A load to the SAME half's stage (stage == half) is prefetch — it won't be read until the next iteration of this half, so it can remain outstanding.
Returns (wait_lgkm_first, wait_vm_last).
Returns:
Tuple[Int, Int]
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!