Mojo struct

EpilogueWarpContext

@register_passable(trivial) struct EpilogueWarpContext[num_accum_stages: Int, stage_stride_cols: Int, cta_group: Int, mma_threads: Int, epilogue_threads: Int]

Epilogue warp context - consumes TMEM data, signals completion.

IMPORTANT: Call Sync.wait() BEFORE constructing to ensure TMEM address is visible from shared memory.

Fields

tmem (EpilogueWarpContext[num_accum_stages, stage_stride_cols, cta_group, mma_threads, epilogue_threads].Tmem):
output_pipeline (EpilogueWarpContext[num_accum_stages, stage_stride_cols, cta_group, mma_threads, epilogue_threads].Pipeline):
dealloc_barrier (EpilogueWarpContext[num_accum_stages, stage_stride_cols, cta_group, mma_threads, epilogue_threads].Dealloc):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable

`comptime` members

`copyinitis_trivial`

comptime __copyinit__is_trivial = True

`delis_trivial`

comptime __del__is_trivial = True

`moveinitis_trivial`

comptime __moveinit__is_trivial = True

`Dealloc`

comptime Dealloc = _WarpContextTypes[num_accum_stages, stage_stride_cols, cta_group, mma_threads, epilogue_threads].Dealloc

`Pipeline`

comptime Pipeline = _WarpContextTypes[num_accum_stages, stage_stride_cols, cta_group, mma_threads, epilogue_threads].Pipeline

`Sync`

comptime Sync = _WarpContextTypes[num_accum_stages, stage_stride_cols, cta_group, mma_threads, epilogue_threads].Sync

`Tmem`

comptime Tmem = _WarpContextTypes[num_accum_stages, stage_stride_cols, cta_group, mma_threads, epilogue_threads].Tmem

Methods

`init`

__init__(tmem: TmemAllocation[cta_group], output_pipeline: OutputTilePipeline[num_accum_stages, stage_stride_cols, cta_group], dealloc_barrier: TmemDeallocBarrier[cta_group]) -> Self

`create`

static create(tmem_addr_storage: SMemArray[UInt32, 1], accum_barriers: SMemArray[SharedMemBarrier, (num_accum_stages * 2)], dealloc_mbar: SMemArray[SharedMemBarrier, 1], mma_complete_mask: UInt16) -> Self

Create Epilogue warp context with all necessary components.

Reads TMEM address from shared memory and creates output pipeline. IMPORTANT: Call Sync.wait() BEFORE calling this to ensure TMEM address is visible.

Args:

tmem_addr_storage (SMemArray): Shared storage containing TMEM address.
accum_barriers (SMemArray): Barrier array for accumulator pipeline.
dealloc_mbar (SMemArray): Barrier for TMEM deallocation synchronization.
mma_complete_mask (UInt16): Multicast mask for MMA completion signaling.

Returns:

Self: Fully initialized EpilogueWarpContext.

`enter`

__enter__(self) -> Self

`exit`

__exit__(self)

`per_k_stage`

per_k_stage[input_origin: MutOrigin, Payload: TilePayload, num_group_stages: Int, k_group_size: Int](mut self, ref [input_origin] input_pipeline: InputTilePipeline[Payload, num_group_stages, k_group_size]) -> EpilogueKContext[origin_of(self.output_pipeline), origin_of(input_origin._mlir_origin.pipeline), num_accum_stages, stage_stride_cols, cta_group, num_group_stages]

Get per-K stage context for blockwise FP8 epilogue.

Bundles output pipeline (MMA→Epilogue sync) and input pipeline (A-scales consumption) into a single context manager.

Example: for k_iter in range(num_iters): with epi_ctx.per_k_stage(input_pipeline) as epi_stage: accum.promote(epi_stage, ...) # Both pipelines signaled automatically

Args:

input_pipeline (InputTilePipeline): The InputTilePipeline (extracts .pipeline internally).

Returns:

EpilogueKContext: EpilogueKContext context manager that handles both pipelines.

Fields​

Implemented traits​

comptime members​

__copyinit__is_trivial​

__del__is_trivial​

__moveinit__is_trivial​

Dealloc​

Pipeline​

Sync​

Tmem​

Methods​

__init__​

create​

__enter__​

__exit__​

per_k_stage​