For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

generic_get_paged_cache_with_scales

def generic_get_paged_cache_with_scales[dtype: DType, scale_dtype: DType, kv_params: KVCacheStaticParams, page_size: Int, quantization_granularity: Int](blocks: LayoutTensor[dtype, Layout.row_major[Int(6)]()], cache_lengths: LayoutTensor[DType.uint32, Layout(IntTuple(Int(-1)))], lookup_table: LayoutTensor[DType.uint32, Layout.row_major[Int(2)]()], max_lengths: LayoutTensor[DType.uint32, Layout.row_major[Int(2)]()], scales: LayoutTensor[scale_dtype, Layout.row_major[Int(6)]()], out result: PagedKVCacheCollection[dtype, kv_params, page_size, blocks.origin, cache_lengths.origin, lookup_table.origin, scales.origin, scale_dtype_=scale_dtype, quantization_granularity_=quantization_granularity])

Create a PagedKVCacheCollection with scales for MLA attention.

Args:

blocks (LayoutTensor[dtype, Layout.row_major[Int(6)]()]): KV cache blocks tensor [num_blocks, kv_dim, num_layers, page_size, num_heads, head_dim].
cache_lengths (LayoutTensor[DType.uint32, Layout(IntTuple(Int(-1)))]): Cache lengths per batch [batch_size].
lookup_table (LayoutTensor[DType.uint32, Layout.row_major[Int(2)]()]): Page lookup table [batch_size, max_pages].
max_lengths (LayoutTensor[DType.uint32, Layout.row_major[Int(2)]()]): Max lengths tensor [[max_seq_length, max_cache_length]].
scales (LayoutTensor[scale_dtype, Layout.row_major[Int(6)]()]): Scales tensor [num_blocks, kv_dim, num_layers, page_size, num_heads, head_dim_granularity].

Returns:

PagedKVCacheCollection[dtype, kv_params, page_size, blocks.origin, cache_lengths.origin, lookup_table.origin, scales.origin, scale_dtype_=scale_dtype, quantization_granularity_=quantization_granularity]