@register_passable(trivial) struct KVConsumerPipeline[dtype: DType, config: FA4Config]

Pipeline for managing the consumption of K and V. This follows the order of Tri Dao and Cutlass implementations (modulo any rotation of the ops through the iterations).

We consume/produce in the following order: 0. S0 <- Q0 @ Kn' 1. O1 <- O1 + P1 @ V{n-1} 2. S1 <- Q1 @ Kn' 3. O0 <- O0 + P0 @ Vn

Note that we have two MMA between calculating Si and consuming Pi, maximizing the overlap between MMAs and softmax calculation. Oi + Pi @ V also depends on the correction, which is computed asynchronously with the softmax in a correction warpgroup (as soon as the softmax writes the correction factor).

wait on K0

S0 <- Q0 @ K0' S1 <- Q1 @ K0'

release K0

wait on V0

O0 <- P0 @ V0 for n in range(1,num_iters): # wait on Kn S0 <- Q0 @ Kn' O1 <- O1 + P1@V{n-1} # release V{n-1} S1 <- Q1 @ Kn' # release Kn # wait on Vn O0 <- P0 @ Vn O1 <- O1 + P1@V{num_iters-1}

wK0, rK0, wV0 wK1, rV0, rK1, wV1 wK2, rV1, rK2, wV2 wK3, rV2, rK3, wV3

wKn(state) wK0(0), rK0(0), wV0(1) wK1(2), rV0(1), rK1(2), wV1(3) wK2(4), rV1(3), rK2(4), wV2(5) wK3(6), rV2(5), rK3(6), wV3(7)

Rules: wK backs up and increments prior to waiting, except K0 rK increments after releasing rV uses backup

wK0(0), rK0(0), wV0(1) wK1(2), rV0(1), rK1(2), wV1(3) wK2(4), rV1(3), rK2(4), wV2(5) rV2(5)

Fields

kv_pipeline (KVPipeline[config.num_kv_stages, config.num_mma_stages]):
k_smem_descriptor (MMASmemDescriptorPair):
v_smem_descriptor (MMASmemDescriptorPair):
v_pipeline_release_index (UInt32):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, Movable, UnknownDestructibility

`comptime` members

`copyinitis_trivial`

comptime __copyinit__is_trivial = True

`delis_trivial`

comptime __del__is_trivial = True

`moveinitis_trivial`

comptime __moveinit__is_trivial = True

`full_kv_bytes`

comptime full_kv_bytes = ((config * config) * size_of[dtype]())

`mma_kv_bytes`

comptime mma_kv_bytes = ((config * config) * size_of[dtype]())

Methods

`init`

__init__(kv_pipeline: KVPipeline[config.num_kv_stages, config.num_mma_stages], smem: LegacyUnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED]) -> Self

__init__(mbar: LegacyUnsafePointer[SharedMemBarrier, address_space=AddressSpace.SHARED], smem: LegacyUnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED]) -> Self

`init`

init(self)

Only one of the producer or consumer should call init().

`wait`

wait[*, mma_stage: Int](self) -> UInt32

Wait on k from the producer, and return the k smem descriptor.

Returns:

UInt32

`wait_k`

wait_k[*, mma_stage: Int = (config - 1), pre_increment: Bool = True](mut self) -> MMASmemDescriptorPair

Wait on k from the producer, and return the k smem descriptor. If pre-increment is true.

Returns:

MMASmemDescriptorPair

`wait_v`

wait_v[*, mma_stage: Int = (config - 1)](self) -> MMASmemDescriptorPair

Returns:

MMASmemDescriptorPair

`release_k`

release_k[*, mma_stage: Int = (config - 1)](mut self, e: Int32)

Must call producer_commit on the tmem resource before calling consumer_release. release_k does increment the pipeline step.

`release_v`

release_v[*, mma_stage: Int = (config - 1)](self, e: Int32)

Must call producer_commit on the tmem resource before calling consumer_release. release_v does not increment the pipeline step.

release K0

wait on V0

Fields​

Implemented traits​

comptime members​

__copyinit__is_trivial​

__del__is_trivial​

__moveinit__is_trivial​

full_kv_bytes​

mma_kv_bytes​

Methods​

__init__​

init​

wait​

wait_k​

wait_v​

release_k​

release_v​

Fields

Implemented traits

`comptime` members

`copyinitis_trivial`

`delis_trivial`

`moveinitis_trivial`

`full_kv_bytes`

`mma_kv_bytes`

Methods

`init`

`init`

`wait`

`wait_k`

`wait_v`

`release_k`

`release_v`