For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo struct
ScheduleConfig
struct ScheduleConfig
Tunable parameters for schedule generation.
Controls the structural decisions in program builders: scheduling strategy, barrier placement, and drain behavior.
The default configuration auto-derives wait counts from the program structure (Halide-inspired: declare intent, derive consequences). Manual wait overrides are available for testing and experimentation but should not be needed for correct operation.
Fields:
scheduling: Strategy for op ordering (IDENTITY, GREEDY, or CSP).
sched_barrier_mask: Bitmask of which blocks get trailing
schedule_barriers. Default: 0b01010101 (blocks 0,2,4,6).
auto_waits: Auto-derive wait counts from schedule order (default: True).
drain_lgkm_mask: Per-block bitmask for selective LDS drains.
auto_drain: Auto-derive drain mask from channel analysis.
lds_contention_penalty: CSP solver penalty for LDS port overlap.
wait_lgkm_first: Manual wait_lgkm(N) override (used when auto_waits=False).
wait_vm_last: Manual wait_vm(N) override (used when auto_waits=False).
lgkm_per_load_a: lgkmcnt ops per load_a (for wait derivation).
lgkm_per_load_b: lgkmcnt ops per load_b (for wait derivation).
lgkm_after_last: Insert wait_lgkm(0) after last block barrier.
minimal_barriers: Suppress per-block s_barriers and set_prio
pairs; emit s_barrier only at TOP (block 0 of each half)
and at the first cross-stage block (MID). Use for kernels
(like 4-wave inline FP8) whose pipeline depth + cross-stage
register rotation provides natural inter-block sync via
register-flow + lgkm waits β the per-block sync is then
overhead. Default False (preserves the ping-pong layout).
omit_mma_set_prio: When also minimal_barriers=True, drop the
pre-MMA s_setprio[1] entirely. s_setprio acts as an
LLVM scheduling barrier that prevents register-allocator
reuse across it, raising VGPR pressure noticeably. The
default ping-pong layout depends on the priority hint for
warp-scheduler throughput; cross-stage rotation kernels
with rocdl.waves_per_eu=1 already get max priority and
the hint is redundant. Default False.
global_before_frag: Swap the in-block emission order of global
loads (DRAMβLDS prefetches) and fragment loads (LDSβregister
reads). The default (False) emits frags first then prefetches
β correct for ping-pong / simple where frags read a
different SMEM stage than the prefetches target, so order
is irrelevant. Set True for kernels (like 4-wave inline FP8)
where frag and prefetch hit the same SMEM region in the same
iter; issuing the prefetch first lets its address-gen overlap
with the frag's LDS-read while the LDS-read port is free.
Default False.
barrier_before_pre_ops: Move the per-block pre_sync+barrier
section to before the frag/prefetch section (instead of
after, between prefetch and MMA). The default (False) gates
barriers as "this MMA's input"; True gates them as "this
half-boundary" so frag/prefetch all happen after the barrier
commits the previous half's writes β matching the
hand-tuned 4-wave layout. Default False.
inter_block_lgkm_drain: When True, populate entry_wait_lgkm on
non-top, non-cross-stage blocks with wait_lgkm(0) so an
inter-mini LDS drain fires between consecutive same-half
MMAs. Hand-tuned 4-wave inline emits this between mini-1 and
mini-2 (and between mini-3 and mini-4); the default ping-pong
schedule does not. Default False.
Fieldsβ
- βscheduling (
SchedulingStrategy): - βsched_barrier_mask (
Int): - βauto_waits (
Bool): - βdrain_lgkm_mask (
Int): - βauto_drain (
Bool): - βlds_contention_penalty (
Int): - βwait_lgkm_first (
Int): - βwait_vm_last (
Int): - βlgkm_per_load_a (
Int): - βlgkm_per_load_b (
Int): - βlgkm_after_last (
Bool): - βminimal_barriers (
Bool): Suppresses per-blocks_barriers andset_priopairs; emitss_barrieronly at top-of-half and the first cross-stage block. - βomit_mma_set_prio (
Bool): Whenminimal_barriers=True, drops the pre-MMAs_setprio[1]entirely so the LLVM register allocator can reuse VGPRs across it. - βmax_vgpr (
Int): Hint for the cost model on the kernel's VGPR budget. Default is effectively unlimited. - βglobal_before_frag (
Bool): Swaps the in-block emission order of global loads and fragment loads. Default emits frags first then prefetches. - βbarrier_before_pre_ops (
Bool): Moves the per-blockpre_sync+ barrier section to before the frag/prefetch section instead of between prefetch and MMA. - βinter_block_lgkm_drain (
Bool): When True, populatesentry_wait_lgkmon non-top, non-cross-stage blocks withwait_lgkm(0)so an inter-mini LDS drain fires between consecutive same-half MMAs. - βwrap_waits_with_sched_barrier (
Bool): Wraps each contiguous wait/barrier group withschedule_barrier()on both sides to fence the LLVM machine scheduler. - βpartial_prologue_drain (
Bool): Skips the framework prologue'swait_vm(0)drains and inter-stage barrier so prefetches stay in flight on entry to the kernel.
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable
Methodsβ
__init__β
__init__(out self, *, scheduling: SchedulingStrategy = SchedulingStrategy.IDENTITY, sched_barrier_mask: Int = 85, auto_waits: Bool = True, drain_lgkm_mask: Int = 0, auto_drain: Bool = False, lds_contention_penalty: Int = 0, wait_lgkm_first: Int = 8, wait_vm_last: Int = 6, lgkm_per_load_a: Int = 0, lgkm_per_load_b: Int = 0, lgkm_after_last: Bool = False, minimal_barriers: Bool = False, omit_mma_set_prio: Bool = False, max_vgpr: Int = 999999, global_before_frag: Bool = False, barrier_before_pre_ops: Bool = False, inter_block_lgkm_drain: Bool = False, partial_prologue_drain: Bool = False, wrap_waits_with_sched_barrier: Bool = False)
from_strategiesβ
static from_strategies(*, scheduling: SchedulingStrategy = SchedulingStrategy.IDENTITY, max_vgpr: Int = 999999, lds_contention_penalty: Int = 0, minimal_barriers: Bool = False, omit_mma_set_prio: Bool = False, sched_barrier_mask: Int = 85, wrap_waits_with_sched_barrier: Bool = False, barrier_before_pre_ops: Bool = False, auto_waits: Bool = True, drain_lgkm_mask: Int = 0, auto_drain: Bool = False, wait_lgkm_first: Int = 8, wait_vm_last: Int = 6, lgkm_after_last: Bool = False, inter_block_lgkm_drain: Bool = False, partial_prologue_drain: Bool = False, global_before_frag: Bool = False, lgkm_per_load_a: Int = 0, lgkm_per_load_b: Int = 0) -> Self
Constructs a ScheduleConfig from grouped strategy values.
Equivalent to the flat-field constructor but groups related
flags by phase (barrier / wait / load). pipeline.strategies
provides named factories (BarrierStrategy.minimal_no_set_prio
etc.) that callers can spread into this constructor.
Existing flat-field callers continue to work unchanged.
Args:
- βscheduling (
SchedulingStrategy): CSP solver scheduling strategy. - βmax_vgpr (
Int): VGPR budget hint for the cost model. - βlds_contention_penalty (
Int): Penalty for LDS port overlap. - βminimal_barriers (
Bool): Suppress per-blocks_barriers andset_priopairs. - βomit_mma_set_prio (
Bool): Drop the pre-MMAs_setprio[1]whenminimal_barriers=True. - βsched_barrier_mask (
Int): Bitmask of which blocks get trailingschedule_barrierfences. - βwrap_waits_with_sched_barrier (
Bool): Wrap each contiguous wait/barrier group withschedule_barrier. - βbarrier_before_pre_ops (
Bool): Move pre_sync + barrier ahead of the frag/global section. - βauto_waits (
Bool): Auto-derive wait counts from program structure. - βdrain_lgkm_mask (
Int): Per-block bitmask for selective LDS drains. - βauto_drain (
Bool): Auto-derivedrain_lgkm_maskfrom channel analysis. - βwait_lgkm_first (
Int): Manualwait_lgkmoverride. - βwait_vm_last (
Int): Manualwait_vmoverride for the last block. - βlgkm_after_last (
Bool): Insertwait_lgkm(0)after the last block's barrier. - βinter_block_lgkm_drain (
Bool): Emitwait_lgkm(0)at non-top, non-cross interior block starts. - βpartial_prologue_drain (
Bool): Skipwait_vm(0)drains in the framework prologue. - βglobal_before_frag (
Bool): Emit globals before frags in each block. - βlgkm_per_load_a (
Int):lgkmcntentries per channel-A frag-load. - βlgkm_per_load_b (
Int):lgkmcntentries per channel-B frag-load.
Returns:
Self: A fully populated ScheduleConfig.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!