IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

ScheduleConfig

struct ScheduleConfig

Tunable parameters for schedule generation.

Controls the structural decisions in program builders: scheduling strategy, barrier placement, and drain behavior.

The default configuration auto-derives wait counts from the program structure (Halide-inspired: declare intent, derive consequences). Manual wait overrides are available for testing and experimentation but should not be needed for correct operation.

Fields: scheduling: Strategy for op ordering (IDENTITY, GREEDY, or CSP). sched_barrier_mask: Bitmask of which blocks get trailing schedule_barriers. Default: 0b01010101 (blocks 0,2,4,6). auto_waits: Auto-derive wait counts from schedule order (default: True). drain_lgkm_mask: Per-block bitmask for selective LDS drains. auto_drain: Auto-derive drain mask from channel analysis. lds_contention_penalty: CSP solver penalty for LDS port overlap. wait_lgkm_first: Manual wait_lgkm(N) override (used when auto_waits=False). wait_vm_last: Manual wait_vm(N) override (used when auto_waits=False). lgkm_per_load_a: lgkmcnt ops per load_a (for wait derivation). lgkm_per_load_b: lgkmcnt ops per load_b (for wait derivation). lgkm_after_last: Insert wait_lgkm(0) after last block barrier. minimal_barriers: Suppress per-block s_barriers and set_prio pairs; emit s_barrier only at TOP (block 0 of each half) and at the first cross-stage block (MID). Use for kernels (like 4-wave inline FP8) whose pipeline depth + cross-stage register rotation provides natural inter-block sync via register-flow + lgkm waits — the per-block sync is then overhead. Default False (preserves the ping-pong layout). omit_mma_set_prio: When also minimal_barriers=True, drop the pre-MMA s_setprio[1] entirely. s_setprio acts as an LLVM scheduling barrier that prevents register-allocator reuse across it, raising VGPR pressure noticeably. The default ping-pong layout depends on the priority hint for warp-scheduler throughput; cross-stage rotation kernels with rocdl.waves_per_eu=1 already get max priority and the hint is redundant. Default False. global_before_frag: Swap the in-block emission order of global loads (DRAM→LDS prefetches) and fragment loads (LDS→register reads). The default (False) emits frags first then prefetches — correct for ping-pong / simple where frags read a different SMEM stage than the prefetches target, so order is irrelevant. Set True for kernels (like 4-wave inline FP8) where frag and prefetch hit the same SMEM region in the same iter; issuing the prefetch first lets its address-gen overlap with the frag's LDS-read while the LDS-read port is free. Default False. barrier_before_pre_ops: Move the per-block pre_sync+barrier section to before the frag/prefetch section (instead of after, between prefetch and MMA). The default (False) gates barriers as "this MMA's input"; True gates them as "this half-boundary" so frag/prefetch all happen after the barrier commits the previous half's writes — matching the hand-tuned 4-wave layout. Default False. inter_block_lgkm_drain: When True, populate entry_wait_lgkm on non-top, non-cross-stage blocks with wait_lgkm(0) so an inter-mini LDS drain fires between consecutive same-half MMAs. Hand-tuned 4-wave inline emits this between mini-1 and mini-2 (and between mini-3 and mini-4); the default ping-pong schedule does not. Default False.

Fields​

  • ​scheduling (SchedulingStrategy):
  • ​sched_barrier_mask (Int):
  • ​auto_waits (Bool):
  • ​drain_lgkm_mask (Int):
  • ​auto_drain (Bool):
  • ​lds_contention_penalty (Int):
  • ​wait_lgkm_first (Int):
  • ​wait_vm_last (Int):
  • ​lgkm_per_load_a (Int):
  • ​lgkm_per_load_b (Int):
  • ​lgkm_after_last (Bool):
  • ​minimal_barriers (Bool): Suppresses per-block s_barriers and set_prio pairs; emits s_barrier only at top-of-half and the first cross-stage block.
  • ​omit_mma_set_prio (Bool): When minimal_barriers=True, drops the pre-MMA s_setprio[1] entirely so the LLVM register allocator can reuse VGPRs across it.
  • ​max_vgpr (Int): Hint for the cost model on the kernel's VGPR budget. Default is effectively unlimited.
  • ​global_before_frag (Bool): Swaps the in-block emission order of global loads and fragment loads. Default emits frags first then prefetches.
  • ​barrier_before_pre_ops (Bool): Moves the per-block pre_sync + barrier section to before the frag/prefetch section instead of between prefetch and MMA.
  • ​inter_block_lgkm_drain (Bool): When True, populates entry_wait_lgkm on non-top, non-cross-stage blocks with wait_lgkm(0) so an inter-mini LDS drain fires between consecutive same-half MMAs.
  • ​wrap_waits_with_sched_barrier (Bool): Wraps each contiguous wait/barrier group with schedule_barrier() on both sides to fence the LLVM machine scheduler.
  • ​partial_prologue_drain (Bool): Skips the framework prologue's wait_vm(0) drains and inter-stage barrier so prefetches stay in flight on entry to the kernel.

Implemented traits​

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable

Methods​

__init__​

__init__(out self, *, scheduling: SchedulingStrategy = SchedulingStrategy.IDENTITY, sched_barrier_mask: Int = 85, auto_waits: Bool = True, drain_lgkm_mask: Int = 0, auto_drain: Bool = False, lds_contention_penalty: Int = 0, wait_lgkm_first: Int = 8, wait_vm_last: Int = 6, lgkm_per_load_a: Int = 0, lgkm_per_load_b: Int = 0, lgkm_after_last: Bool = False, minimal_barriers: Bool = False, omit_mma_set_prio: Bool = False, max_vgpr: Int = 999999, global_before_frag: Bool = False, barrier_before_pre_ops: Bool = False, inter_block_lgkm_drain: Bool = False, partial_prologue_drain: Bool = False, wrap_waits_with_sched_barrier: Bool = False)

from_strategies​

static from_strategies(*, scheduling: SchedulingStrategy = SchedulingStrategy.IDENTITY, max_vgpr: Int = 999999, lds_contention_penalty: Int = 0, minimal_barriers: Bool = False, omit_mma_set_prio: Bool = False, sched_barrier_mask: Int = 85, wrap_waits_with_sched_barrier: Bool = False, barrier_before_pre_ops: Bool = False, auto_waits: Bool = True, drain_lgkm_mask: Int = 0, auto_drain: Bool = False, wait_lgkm_first: Int = 8, wait_vm_last: Int = 6, lgkm_after_last: Bool = False, inter_block_lgkm_drain: Bool = False, partial_prologue_drain: Bool = False, global_before_frag: Bool = False, lgkm_per_load_a: Int = 0, lgkm_per_load_b: Int = 0) -> Self

Constructs a ScheduleConfig from grouped strategy values.

Equivalent to the flat-field constructor but groups related flags by phase (barrier / wait / load). pipeline.strategies provides named factories (BarrierStrategy.minimal_no_set_prio etc.) that callers can spread into this constructor.

Existing flat-field callers continue to work unchanged.

Args:

  • ​scheduling (SchedulingStrategy): CSP solver scheduling strategy.
  • ​max_vgpr (Int): VGPR budget hint for the cost model.
  • ​lds_contention_penalty (Int): Penalty for LDS port overlap.
  • ​minimal_barriers (Bool): Suppress per-block s_barriers and set_prio pairs.
  • ​omit_mma_set_prio (Bool): Drop the pre-MMA s_setprio[1] when minimal_barriers=True.
  • ​sched_barrier_mask (Int): Bitmask of which blocks get trailing schedule_barrier fences.
  • ​wrap_waits_with_sched_barrier (Bool): Wrap each contiguous wait/barrier group with schedule_barrier.
  • ​barrier_before_pre_ops (Bool): Move pre_sync + barrier ahead of the frag/global section.
  • ​auto_waits (Bool): Auto-derive wait counts from program structure.
  • ​drain_lgkm_mask (Int): Per-block bitmask for selective LDS drains.
  • ​auto_drain (Bool): Auto-derive drain_lgkm_mask from channel analysis.
  • ​wait_lgkm_first (Int): Manual wait_lgkm override.
  • ​wait_vm_last (Int): Manual wait_vm override for the last block.
  • ​lgkm_after_last (Bool): Insert wait_lgkm(0) after the last block's barrier.
  • ​inter_block_lgkm_drain (Bool): Emit wait_lgkm(0) at non-top, non-cross interior block starts.
  • ​partial_prologue_drain (Bool): Skip wait_vm(0) drains in the framework prologue.
  • ​global_before_frag (Bool): Emit globals before frags in each block.
  • ​lgkm_per_load_a (Int): lgkmcnt entries per channel-A frag-load.
  • ​lgkm_per_load_b (Int): lgkmcnt entries per channel-B frag-load.

Returns:

Self: A fully populated ScheduleConfig.