For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
greedy_schedule
greedy_schedule(body: LoopBody) -> List[Int]
Constrained list scheduler for MMA-centered block structure.
Derives a valid execution order from the dependency graph, respecting structural constraints of the interleaved ping-pong kernel:
- Half isolation: ops 0..num_ops/2 are scheduled first (blocks 0-3), then ops num_ops/2..num_ops (blocks 4-7). This preserves the two-half warp-group structure that build_program_from_ldg_ordered() requires (first 12 ops โ blocks 0-3, last 12 โ blocks 4-7).
- Data dependencies: all d=0 predecessors (FLOW and ANTI) must be scheduled before the consumer. FLOW edges enforce RAW (register and accumulator) deps. ANTI edges enforce WAR (LDS buffer) deps: all mma_loads reading from an LDS buffer must complete before any prefetch global_load writes to that buffer.
- MMA-centered blocks: each block terminates with exactly 1 MMA.
- Priority: lowest op index wins among ready ops. This reproduces the declaration order from define_interleaved_loop_body(), which is the correct execution order. The ScheduleConfig wait counts (vmcnt, lgkmcnt) are calibrated for this specific ordering, so any reordering requires recalculating those counts.
Returns a permutation of [0, num_ops) suitable for build_program_from_ldg_ordered(). When ops are defined in execution order, this produces the identity permutation โ validating that the dependency graph is sufficient to derive the schedule.
Returns:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!