Skip to main content

Mojo struct

OutputTilePipeline

@register_passable(trivial) struct OutputTilePipeline[num_stages: Int, stage_stride_cols: Int, cta_group: Int]

Pipeline for MMA→Epilogue TMEM stage synchronization.

Fields

  • pipeline (OutputTilePipeline[num_stages, stage_stride_cols, cta_group].Pipeline):
  • tmem (OutputTilePipeline[num_stages, stage_stride_cols, cta_group].Tmem):
  • mma_complete_mask (UInt16):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable

comptime members

__copyinit__is_trivial

comptime __copyinit__is_trivial = True

__del__is_trivial

comptime __del__is_trivial = True

__moveinit__is_trivial

comptime __moveinit__is_trivial = True

BarrierArray

comptime BarrierArray = SMemArray[SharedMemBarrier, (num_stages * 2)]

Pipeline

comptime Pipeline = ProducerConsumerPipeline[num_stages]

Stage

comptime Stage = OutputStage[num_stages, stage_stride_cols, cta_group]

Tmem

comptime Tmem = TmemAllocation[cta_group]

Methods

__init__

__init__(barriers: SMemArray[SharedMemBarrier, (num_stages * 2)], tmem: TmemAllocation[cta_group], mma_complete_mask: UInt16) -> Self

Initialize from barrier array, TMEM allocation, and multicast mask.

init_barriers

static init_barriers(storage_ptr: LegacyUnsafePointer[SharedMemBarrier, address_space=AddressSpace.SHARED], producer_arv_count: Int32, consumer_arv_count: Int32)

Initialize pipeline barriers. Called once by elect_one thread.

acquire_for_mma

acquire_for_mma(self) -> OutputTilePipeline[num_stages, stage_stride_cols, cta_group].Stage

Acquire stage for MMA, waiting for epilogue to finish.

Returns:

OutputTilePipeline

release_from_mma

release_from_mma(mut self, stage: OutputStage[num_stages, stage_stride_cols, cta_group])

Signal MMA completion using mma_arrive (1-SM) or multicast (2-SM).

acquire_for_epilogue

acquire_for_epilogue(self) -> OutputTilePipeline[num_stages, stage_stride_cols, cta_group].Stage

Acquire stage for epilogue, waiting for MMA to complete.

Returns:

OutputTilePipeline

release_from_epilogue

release_from_epilogue(mut self)

Signal epilogue completion, freeing stage for MMA reuse.

producer

producer[origin: MutOrigin, //](ref [origin] self) -> OutputProducer[origin, num_stages, stage_stride_cols, cta_group]

Get producer view for MMA warp.

Returns:

OutputProducer

consumer

consumer[origin: MutOrigin, //](ref [origin] self) -> OutputConsumer[origin, num_stages, stage_stride_cols, cta_group]

Get consumer view for epilogue warp.

Returns:

OutputConsumer

get_pipeline

get_pipeline(self) -> OutputTilePipeline[num_stages, stage_stride_cols, cta_group].Pipeline

Get underlying pipeline (used during barrier initialization).

Returns:

OutputTilePipeline

per_k

per_k[origin: MutOrigin, //](ref [origin] self) -> OutputKPipeline[origin, num_stages, stage_stride_cols, cta_group]

Get per-K-iteration view for kernels with per-K signaling.

Unlike producer()/consumer() which signal once per tile (after all K iterations), this view signals after each K iteration. Use for kernels with per-K accumulation patterns (e.g., blockwise FP8).

Returns:

OutputKPipeline: OutputKPipeline view that provides produce()/consume() context managers for per-K-iteration barrier signaling.

per_k_epilogue

per_k_epilogue[output_origin: MutOrigin, input_origin: MutOrigin, num_input_stages: Int](ref [output_origin] self, ref [input_origin] input_pipeline: ProducerConsumerPipeline[num_input_stages]) -> EpilogueKContext[output_origin, input_origin, num_stages, stage_stride_cols, cta_group, num_input_stages]

Get combined per-K epilogue context for blockwise FP8.

Bundles output pipeline (MMA->Epilogue sync) and input pipeline (A-scales consumption) into a single context manager.

Example: for k_iter in range(num_iters): with output_pipeline.per_k_epilogue(input_pipeline) as stage: accum.promote(stage, ...) # Both pipelines signaled automatically

Args:

Returns:

EpilogueKContext: EpilogueKContext context manager that handles both pipelines.

Was this page helpful?