For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Pipeline4Wave

struct Pipeline4Wave[geometry: KernelGeometry]

4-wave pipeline schedule with cross-stage register rotation.

Returns the 24-op body in mini-iter order. Framework consumes that order verbatim under SchedulingStrategy.IDENTITY, so the final kernel emission matches the hand-written _run_iter body op-for-op (modulo wait-count derivation, which the framework handles via derive_waits_from_blocks).

Takes a KernelGeometry (kernel-shape-derived constants) as its only template parameter; replaces the previous [is_fp8, lgkm_a, lgkm_b] triple. lgkm_per_load_* is read directly from geometry, not threaded through ScheduleConfig.

Parameters

geometry (KernelGeometry): Kernel-shape-derived constants (lgkm/vm costs, etc.).

Implemented traits

AnyType, ImplicitlyDeletable, PipelineSchedule

Methods

`init`

def __init__(out self, config: ScheduleConfig = Pipeline4Wave._default_schedule_config(), target: TargetProfile = mi355x_target(Int(4), Int(4), Int(1)))

Constructs a Pipeline4Wave schedule with optional overrides.

Args:

config (ScheduleConfig): Schedule-level knobs (wait counts, barrier policy). Cross-stage-rotation invariants are re-applied even if the caller mutates them.
target (TargetProfile): Target hardware profile (defaults to MI355X).

`config`

def config(self) -> PipelineConfig

Returns the underlying target PipelineConfig.

Returns:

PipelineConfig: The pipeline config from the target profile.

`declare_ops`

def declare_ops(self) -> List[OpDesc]

Declares the logical 24-op body across both K-partitions.

Returns:

List[OpDesc]: The full list of OpDescs in mini-iter order.

`build_body`

def build_body(self) -> List[OpDesc]

Annotates logical ops with target cost model.

Skips double_buffer_reorder — the body is already in mini-iter order and mma_block_interleave_list would break cross-stage frag placement (it matches frags to MMAs by subtile only, ignoring the frag's stage field, so it cannot distinguish a same-stage sub=0 frag from a cross-stage sub=0 frag).

Returns:

List[OpDesc]: The annotated list of OpDescs ready for compilation.

`bootstrap_frags`

def bootstrap_frags(self) -> List[OpDesc]

Bootstraps A_quad[0] + B_quad[0] for the first main-loop iter.

The body's sub=0 frags read the cross stage as part of the cross-stage rotation pattern. For the very first main iter there's no previous half to have populated those quadrants, so we explicitly emit two same-stage sub=0 frag-loads here. The framework pairs each with a partial wait_vm drain (and a barrier) so each fires after exactly the prefetch it depends on completes — the remaining 6 prefetches stay in flight.

Returns:

List[OpDesc]: A 2-element list of A/B sub=0 frag-load OpDescs.

`derive_edges`

def derive_edges(self, body: List[OpDesc]) -> List[DepEdge]

Derives dependency edges with cross-stage rotation fixups.

Runs the framework's default edge derivation, filters out the spurious same-partition FLOW edges that Phase 1 emits for cross-stage frags, then appends the cross-partition FLOW + same- partition ANTI edges. Both helpers live in pipeline.phase_derivation and are reusable across cross-stage rotation schedules.

Args:

body (List[OpDesc]): The annotated op list returned by build_body().

Returns:

List[DepEdge]: The complete list of dependency edges for wait derivation.

`schedule_config`

def schedule_config(self) -> ScheduleConfig

Returns the schedule-level configuration for this pipeline.

Returns:

ScheduleConfig: The ScheduleConfig set up in __init__.

`build_explicit_blocks`

def build_explicit_blocks(self, body: List[OpDesc], program: PipelineProgram) -> List[List[OpDesc]]

Emits each block via emit_minimal_barrier_block.

Same shape as the hand-tuned _run_iter's mini-iters: optional sched_barrier wrap + entry waits, frag/load section, optional sync-group wrap + pre_sync/barrier/post-barrier-lgkm, then the MMA.

Wait values, frag/load assignments, and barrier flags come from program.blocks[i] (populated by _construct_mma_blocks + auto-wait derivation). The schedule's only contribution is choosing the helper — the per-block ops are entirely framework-derived.

Args:

body (List[OpDesc]): The annotated op list (unused here; ops come from program.blocks).
program (PipelineProgram): The compiled pipeline program containing per-block wait counts and barrier flags.

Returns:

List[List[OpDesc]]: One inner list per block, each holding the ops emitted by emit_minimal_barrier_block.

Parameters​

Implemented traits​

Methods​

__init__​

config​

declare_ops​

build_body​

bootstrap_frags​

derive_edges​

schedule_config​

build_explicit_blocks​

Parameters

Implemented traits

Methods

`init`

`config`

`declare_ops`

`build_body`

`bootstrap_frags`

`derive_edges`

`schedule_config`

`build_explicit_blocks`