IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

OutputTilePipeline

struct OutputTilePipeline[opc: OutputPipelineConfig]

Pipeline for MMA→Epilogue TMEM stage synchronization.

Fields

  • pipeline (OutputTilePipeline[opc].Pipeline):
  • tmem (OutputTilePipeline[opc].Tmem):
  • mma_complete_mask (UInt16):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

comptime members

BarrierArray

comptime BarrierArray = SMemArray[SharedMemBarrier, (OutputTilePipeline[opc].num_stages * 2)]

cta_group

comptime cta_group = opc.cta_group

num_stages

comptime num_stages = opc.num_stages

Pipeline

comptime Pipeline = ProducerConsumerPipeline[OutputTilePipeline[opc].num_stages]

Stage

comptime Stage = OutputStage[opc]

stage_stride_cols

comptime stage_stride_cols = opc.stage_stride_cols

Tmem

comptime Tmem = TmemAllocation[OutputTilePipeline[opc].cta_group]

Methods

__init__

def __init__(barriers_ptr: UnsafePointer[SharedMemBarrier, MutUntrackedOrigin, address_space=AddressSpace.SHARED], tmem: TmemAllocation[Self.cta_group], mma_complete_mask: UInt16) -> Self

Initialize from barrier pointer, TMEM allocation, and multicast mask.

init_barriers

static def init_barriers(storage_ptr: UnsafePointer[SharedMemBarrier, MutUntrackedOrigin, address_space=AddressSpace.SHARED], producer_arv_count: Int32, consumer_arv_count: Int32)

Initialize pipeline barriers. Called once by elect_one thread.

acquire_for_mma

def acquire_for_mma(self) -> Self.Stage

Acquire stage for MMA, waiting for epilogue to finish.

Returns:

Self.Stage

release_from_mma

def release_from_mma(mut self, stage: OutputStage[opc])

Signal MMA completion using mma_arrive (1-SM) or multicast (2-SM).

acquire_for_epilogue

def acquire_for_epilogue(self) -> Self.Stage

Acquire stage for epilogue, waiting for MMA to complete.

Returns:

Self.Stage

release_from_epilogue

def release_from_epilogue(mut self)

Signal epilogue completion, freeing stage for MMA reuse.

producer

def producer[origin: MutOrigin, //](ref[opc] self) -> OutputProducer[origin, opc]

Get producer view for MMA warp.

Returns:

OutputProducer[origin, opc]

consumer

def consumer[origin: MutOrigin, //](ref[opc] self) -> OutputConsumer[origin, opc]

Get consumer view for epilogue warp.

Returns:

OutputConsumer[origin, opc]

acquire_mma_linear

def acquire_mma_linear[origin: MutOrigin, //](ref[opc] self) -> MmaStage[origin, opc]

Acquire a stage for MMA using linear types.

Waits for the epilogue to free the current stage, then returns a linear type handle that MUST be released (compiler-enforced).

Usage: var stage = output_pipeline.acquire_mma_linear() mma_op.mma(a_tile, b_tile, stage.tmem_offset()) mma_op.commit(stage.mbar()) stage^.release() # Signals mma_arrive and advances

Returns:

MmaStage[origin, opc]: An MmaStage handle that must be released.

acquire_epilogue_linear

def acquire_epilogue_linear[origin: MutOrigin, //](ref[opc] self) -> EpilogueStage[origin, opc]

Acquire a stage for epilogue using linear types.

Waits for MMA to complete the current stage, then returns a linear type handle that MUST be released (compiler-enforced).

Usage: var stage = output_pipeline.acquire_epilogue_linear() process_tmem(stage.tmem()) stage^.release() # Advances to next stage

Returns:

EpilogueStage[origin, opc]: An EpilogueStage handle that must be released.

get_pipeline

def get_pipeline(self) -> Self.Pipeline

Get underlying pipeline (used during barrier initialization).

Returns:

Self.Pipeline

per_k

def per_k[origin: MutOrigin, //](ref[opc] self) -> OutputKPipeline[origin, opc]

Get per-K-iteration view for kernels with per-K signaling.

Unlike producer()/consumer() which signal once per tile (after all K iterations), this view signals after each K iteration. Use for kernels with per-K accumulation patterns (e.g., blockwise FP8).

Returns:

OutputKPipeline[origin, opc]: OutputKPipeline view that provides produce()/consume() context managers for per-K-iteration barrier signaling.

per_k_epilogue

def per_k_epilogue[output_origin: MutOrigin, input_origin: MutOrigin, num_input_stages: Int](ref[opc] self, ref[output_origin] input_pipeline: ProducerConsumerPipeline[num_input_stages]) -> EpilogueKContext[output_origin, input_origin, opc, num_input_stages]

Get combined per-K epilogue context for blockwise FP8.

Bundles output pipeline (MMA->Epilogue sync) and input pipeline (A-scales consumption) into a single context manager.

Example: for k_iter in range(num_iters): with output_pipeline.per_k_epilogue(input_pipeline) as stage: accum.promote(stage, ...) # Both pipelines signaled automatically

Args:

Returns:

EpilogueKContext[output_origin, input_origin, opc, num_input_stages]: EpilogueKContext context manager that handles both pipelines.