Skip to main content

Mojo struct

Pipeline4Wave

struct Pipeline4Wave[geometry: KernelGeometry]

4-wave pipeline schedule with cross-stage register rotation.

Returns the 24-op body in mini-iter order. Framework consumes that order verbatim under SchedulingStrategy.IDENTITY, so the final kernel emission matches the hand-written _run_iter body op-for-op (modulo wait-count derivation, which the framework handles via derive_waits_from_blocks).

Takes a KernelGeometry (kernel-shape-derived constants) as its only template parameter; replaces the previous [is_fp8, lgkm_a, lgkm_b] triple. lgkm_per_load_* is read directly from geometry, not threaded through ScheduleConfig.

Parameters​

  • ​geometry (KernelGeometry): Kernel-shape-derived constants (lgkm/vm costs, etc.).

Implemented traits​

AnyType, ImplicitlyDestructible, PipelineSchedule

Methods​

__init__​

__init__(out self, config: ScheduleConfig = Pipeline4Wave._default_schedule_config(), target: TargetProfile = mi355x_target(4, 4, 1))

Constructs a Pipeline4Wave schedule with optional overrides.

Args:

  • ​config (ScheduleConfig): Schedule-level knobs (wait counts, barrier policy). Cross-stage-rotation invariants are re-applied even if the caller mutates them.
  • ​target (TargetProfile): Target hardware profile (defaults to MI355X).

config​

config(self) -> PipelineConfig

Returns the underlying target PipelineConfig.

Returns:

PipelineConfig: The pipeline config from the target profile.

declare_ops​

declare_ops(self) -> List[OpDesc]

Declares the logical 24-op body across both K-partitions.

Returns:

List[OpDesc]: The full list of OpDescs in mini-iter order.

build_body​

build_body(self) -> List[OpDesc]

Annotates logical ops with target cost model.

Skips double_buffer_reorder β€” the body is already in mini-iter order and mma_block_interleave_list would break cross-stage frag placement (it matches frags to MMAs by subtile only, ignoring the frag's stage field, so it cannot distinguish a same-stage sub=0 frag from a cross-stage sub=0 frag).

Returns:

List[OpDesc]: The annotated list of OpDescs ready for compilation.

bootstrap_frags​

bootstrap_frags(self) -> List[OpDesc]

Bootstraps A_quad[0] + B_quad[0] for the first main-loop iter.

The body's sub=0 frags read the cross stage as part of the cross-stage rotation pattern. For the very first main iter there's no previous half to have populated those quadrants, so we explicitly emit two same-stage sub=0 frag-loads here. The framework pairs each with a partial wait_vm drain (and a barrier) so each fires after exactly the prefetch it depends on completes β€” the remaining 6 prefetches stay in flight.

Returns:

List[OpDesc]: A 2-element list of A/B sub=0 frag-load OpDescs.

derive_edges​

derive_edges(self, body: List[OpDesc]) -> List[DepEdge]

Derives dependency edges with cross-stage rotation fixups.

Runs the framework's default edge derivation, filters out the spurious same-partition FLOW edges that Phase 1 emits for cross-stage frags, then appends the cross-partition FLOW + same- partition ANTI edges. Both helpers live in pipeline.phase_derivation and are reusable across cross-stage rotation schedules.

Args:

  • ​body (List[OpDesc]): The annotated op list returned by build_body().

Returns:

List[DepEdge]: The complete list of dependency edges for wait derivation.

schedule_config​

schedule_config(self) -> ScheduleConfig

Returns the schedule-level configuration for this pipeline.

Returns:

ScheduleConfig: The ScheduleConfig set up in __init__.

build_explicit_blocks​

build_explicit_blocks(self, body: List[OpDesc], program: PipelineProgram) -> List[List[OpDesc]]

Emits each block via emit_minimal_barrier_block.

Same shape as the hand-tuned _run_iter's mini-iters: optional sched_barrier wrap + entry waits, frag/load section, optional sync-group wrap + pre_sync/barrier/post-barrier-lgkm, then the MMA.

Wait values, frag/load assignments, and barrier flags come from program.blocks[i] (populated by _construct_mma_blocks + auto-wait derivation). The schedule's only contribution is choosing the helper β€” the per-block ops are entirely framework-derived.

Args:

  • ​body (List[OpDesc]): The annotated op list (unused here; ops come from program.blocks).
  • ​program (PipelineProgram): The compiled pipeline program containing per-block wait counts and barrier flags.

Returns:

List[List[OpDesc]]: One inner list per block, each holding the ops emitted by emit_minimal_barrier_block.