Mojo function
generic_get_paged_cache_with_scales
generic_get_paged_cache_with_scales[dtype: DType, scale_dtype: DType, kv_params: KVCacheStaticParams, page_size: Int, quantization_granularity: Int](blocks: LayoutTensor[dtype, Layout.row_major[6]()], cache_lengths: LayoutTensor[DType.uint32, Layout(IntTuple(-1))], lookup_table: LayoutTensor[DType.uint32, Layout.row_major[2]()], max_lengths: LayoutTensor[DType.uint32, Layout.row_major[2]()], scales: LayoutTensor[scale_dtype, Layout.row_major[6]()], out result: PagedKVCacheCollection[dtype, kv_params, page_size, scale_dtype, quantization_granularity])
Create a PagedKVCacheCollection with scales for MLA attention.
Args:
- βblocks (
LayoutTensor[dtype, Layout.row_major[6]()]): KV cache blocks tensor [num_blocks, kv_dim, num_layers, page_size, num_heads, head_dim]. - βcache_lengths (
LayoutTensor[DType.uint32, Layout(IntTuple(-1))]): Cache lengths per batch [batch_size]. - βlookup_table (
LayoutTensor[DType.uint32, Layout.row_major[2]()]): Page lookup table [batch_size, max_pages]. - βmax_lengths (
LayoutTensor[DType.uint32, Layout.row_major[2]()]): Max lengths tensor [[max_seq_length, max_cache_length]]. - βscales (
LayoutTensor[scale_dtype, Layout.row_major[6]()]): Scales tensor [num_blocks, kv_dim, num_layers, page_size, num_heads, head_dim_granularity].
Returns:
PagedKVCacheCollection[dtype, kv_params, page_size, scale_dtype, quantization_granularity]
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!