Mojo function
kv_cache_2m_iadd_dispatch
kv_cache_2m_iadd_dispatch[dtype: DType, collection_t: KVCollectionT, //, target: StringSlice[StaticConstantOrigin]](kv: LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], cache: collection_t, input_row_offsets: LayoutTensor[DType.uint32, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], lora_end_idx: LayoutTensor[DType.int64, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], batch_seq_len: LayoutTensor[DType.int64, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], layer_idx: UInt32, ctx: Optional[DeviceContext])
In-place add to paged KV cache with interleaved K/V layout. This kernel is only used for LoRA.
Performs an in-place addition of new key-value projections to paged KV cache.
The input tensor a uses a "2M" layout where keys and values are interleaved:
rows [0, M) contain keys and rows [M, 2M) contain values, where M is the number
of tokens. We use the lora_end_idx as our stop-gap on whether we write the LoRA
values to KV-cache. We call this value m since this value will be a subset of the
total tokens in the batch. We write tokens to K as [0, m) and V as [M, M + m).
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!