For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo struct
PipelineConfig
struct PipelineConfig
Declarative pipeline strategy.
Captures all the knowledge needed to transform a logical loop body into a pipelined schedule: buffer depth, prefetch distance, loop-carried edges, MMA block sizing, and hardware model.
Platform-specific factories (e.g., mi355x_double_buffer() in amd_target.mojo) provide tuned configurations.
Fieldsβ
- βdepth (
Int): - βprefetch (
Int): - βdrain_passes (
Int): - βprologue_fill (
Int): - βloop_carried (
LoopCarriedSpec): - βblock_sizing (
BlockSizing): - βfrag_order (
FragOrder): - βm_mmas (
Int): - βn_mmas (
Int): - βnum_partitions (
Int): - βmma_serial (
Bool): - βmma_latency (
Int): - βvm_per_load_a (
Int): - βvm_per_load_b (
Int): - βlgkm_per_load_a (
Int): Kernel-geometry-derivedlgkmcntentries per channel-A frag-load.0falls back toScheduleConfig.lgkm_per_load_a. - βlgkm_per_load_b (
Int): Kernel-geometry-derivedlgkmcntentries per channel-B frag-load.0falls back toScheduleConfig.lgkm_per_load_b. - βch0_match_field (
Int): - βch1_match_field (
Int): - βwarp_stagger (
WarpStaggerRule): - βcross_stage_rotation (
Bool): True when the schedule intentionally pre-loads the next K-partition's leading-quadrant fragments from the other SMEM stage (4-wave's mini-3/4 register rotation). Relaxes the "fragment loads in half h must use stage h" invariant inprogram_builder._verify_stage_consistencyβ same-stage and cross-stage frags coexist by design when this is True. Default False keeps the strict check active for ping-pong and other schedules that don't rotate.
Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable
Methodsβ
__init__β
__init__(out self, *, depth: Int, prefetch: Int, drain_passes: Int, prologue_fill: Int, loop_carried: LoopCarriedSpec, block_sizing: BlockSizing, frag_order: FragOrder, m_mmas: Int, n_mmas: Int, num_partitions: Int, mma_serial: Bool, mma_latency: Int, vm_per_load_a: Int, vm_per_load_b: Int, ch0_match_field: Int, ch1_match_field: Int, warp_stagger: WarpStaggerRule, lgkm_per_load_a: Int = 0, lgkm_per_load_b: Int = 0, cross_stage_rotation: Bool = False)
Constructs a PipelineConfig from individual fields.
lgkm_per_load_a / lgkm_per_load_b are optional kernel-geometry
defaults; pass 0 to fall back to ScheduleConfig.lgkm_per_load_*.
See the field-level docstrings on PipelineConfig for per-field
meanings.
Args:
- βdepth (
Int): Pipeline buffer depth (1 = single, 2 = double). - βprefetch (
Int): DRAM-prefetch distance, typically 1. - βdrain_passes (
Int): Epilogue drain iteration count. - βprologue_fill (
Int): Extra load iterations in the prologue. - βloop_carried (
LoopCarriedSpec): Ops crossing loop iteration boundaries. - βblock_sizing (
BlockSizing): MMA block op targets. - βfrag_order (
FragOrder): Fragment ordering within a block. - βm_mmas (
Int): M-dimension MMA tile count. - βn_mmas (
Int): N-dimension MMA tile count. - βnum_partitions (
Int): Number of warp groups. - βmma_serial (
Bool): Whether the MMA unit is serial. - βmma_latency (
Int): MMA latency in cycles. - βvm_per_load_a (
Int):vmcntops per channel-A global load. - βvm_per_load_b (
Int):vmcntops per channel-B global load. - βch0_match_field (
Int): Channel-0 register-flow match field. - βch1_match_field (
Int): Channel-1 register-flow match field. - βwarp_stagger (
WarpStaggerRule): Warp-group stagger configuration. - βlgkm_per_load_a (
Int):lgkmcntops per channel-A frag-load (0= fall back toScheduleConfig). - βlgkm_per_load_b (
Int):lgkmcntops per channel-B frag-load (0= fall back toScheduleConfig). - βcross_stage_rotation (
Bool): Set to True for schedules that intentionally pre-load the next K-partition's leading-quadrant fragments from the cross stage (4-wave's mini-3/4 rotation). Relaxes the strict stage-consistency invariant in_verify_stage_consistency.
mmas_per_partitionβ
globals_per_partitionβ
globals_per_partition(self) -> Int
Global loads per warp group: m_mmas + n_mmas (A + B tiles).
Returns:
frags_per_partitionβ
frags_per_partition(self) -> Int
Fragment loads per warp group: m_mmas + n_mmas (A + B frags).
Returns:
ops_per_partitionβ
total_opsβ
blocks_per_partitionβ
total_blocksβ
compute_match_keyβ
compute_match_key(self, compute_op: OpDesc, channel: Int) -> Int
Extract the compute field that a fragment on channel matches.
For channel 0 (A): returns compute.stage (row). For channel 1 (B): returns compute.subtile (col).
Returns:
vm_per_channelβ
vm_per_channel(self, channel: Int) -> Int
Return vmcnt cost for a global load on the given channel.
Returns:
lgkm_per_channelβ
lgkm_per_channel(self, channel: Int) -> Int
Returns the lgkmcnt cost for a fragment load on the given channel.
Reads from lgkm_per_load_a/b set on the config (typically
populated from KernelGeometry). Returns 0 if unset; callers
should fall back to ScheduleConfig.lgkm_per_load_* for
legacy schedules.
Args:
- βchannel (
Int): 0 for channel A, anything else for channel B.
Returns:
Int: lgkmcnt entries per fragment load on channel, or 0 if
unset.
total_edgesβ
total_edges(self) -> Int
Total dependency edges for double-buffer pipeline.
Four phases of edges connect ops within and across iterations:
- reg_flow: fragment_load β compute (register FLOW). Each half has 2 channels (A, B), each channel's frag feeds m*n compute ops.
- accum: compute β compute (accumulator forwarding). m*n accumulator tiles forwarded between halves, twice (half0βhalf1 at d=0, half1βhalf0 at d=1).
- lds_flow: global_load β fragment_load (LDS FLOW). Each half has g global loads, each feeds one frag per buffer stage (Γ2 for double-buffering).
- lds_anti: fragment_load β global_load (LDS ANTI). Prevents a prefetch write from overwriting data a frag still needs. 2*g frags total minus 1 (last frag has no successor load), doubled for both channels.
Returns:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!