Python module
naive_cache
Naive KV cache for the Transformer.
NaiveKVCacheManager
class max.pipelines.kv_cache.naive_cache.NaiveKVCacheManager(params: KVCacheParams, max_cache_batch_size: int, max_seq_len: int, num_layers: int, devices: List[Device], session: InferenceSession)
cache_shape
estimated_memory_size()
classmethod estimated_memory_size(params: KVCacheParams, max_cache_batch_size: int, max_seq_len: int, num_layers: int, devices: List[Device]) → int
Returns the estimated total memory usage of the kv cache.
fetch()
fetch(seq_ids: list[int]) → List[tuple[max.driver.tensor.Tensor, max.driver.tensor.Tensor, max.driver.tensor.Tensor, max.driver.tensor.Tensor]]
increment_cache_lengths()
increment_cache_lengths(kv_cache_inputs: List[tuple[max.driver.tensor.Tensor, max.driver.tensor.Tensor, max.driver.tensor.Tensor, max.driver.tensor.Tensor]], prev_model_inputs: tuple[max.driver.tensor.Tensor, ...]) → List[tuple[max.driver.tensor.Tensor, max.driver.tensor.Tensor, max.driver.tensor.Tensor, max.driver.tensor.Tensor]]
Prepare the inputs for a multistep execution, generally by incrementing the cache lengths. This should not require a device synchronization, as this would defeat the purpose of multistep execution.
This should also not update the cache lengths in our manager, this batch is still considered in-progress.
input_symbols()
input_symbols() → List[tuple[max.graph.type.TensorType, max.graph.type.TensorType, max.graph.type.TensorType, max.graph.type.TensorType]]
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!