Skip to main content

Mojo function

kv_cache_2m_iadd_dispatch

kv_cache_2m_iadd_dispatch[dtype: DType, collection_t: KVCollectionT, //, target: StringSlice[StaticConstantOrigin]](kv: LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], cache: collection_t, input_row_offsets: LayoutTensor[DType.uint32, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], lora_end_idx: LayoutTensor[DType.int64, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], batch_seq_len: LayoutTensor[DType.int64, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], layer_idx: UInt32, ctx: Optional[DeviceContext])

In-place add to paged KV cache with interleaved K/V layout. This kernel is only used for LoRA.

Performs an in-place addition of new key-value projections to paged KV cache. The input tensor a uses a "2M" layout where keys and values are interleaved: rows [0, M) contain keys and rows [M, 2M) contain values, where M is the number of tokens. We use the lora_end_idx as our stop-gap on whether we write the LoRA values to KV-cache. We call this value m since this value will be a subset of the total tokens in the batch. We write tokens to K as [0, m) and V as [M, M + m).

Was this page helpful?