Python package
kv_cache
KV cache management for efficient attention computation during inference.
This package provides implementations for managing key-value caches used in transformer models. The paged attention implementation enables efficient memory management by fragmenting cache memory into pages, allowing for better memory utilization and support for prefix caching.
Functionsβ
load_kv_manager: Load and initialize a KV cache manager.estimate_kv_cache_size: Estimate KV cache memory requirements.infer_optimal_batch_size: Infer optimal batch size based on available cache memory.available_port: Find an available TCP port for transfer engine communication.
Modulesβ
registry: KV cache manager factory functions and utilities.null_cache_manager: Null KV cache manager implementation.
Packagesβ
paged_cache: Paged attention KV cache implementation.
Classesβ
PagedKVCacheManager: Manager for paged KV cache with data and tensor parallelism support.NullKVCacheManager: Null KV cache manager for compile-only mode.KVTransferEngine: Manages KV cache transfers between devices in distributed settings.KVTransferEngineMetadata: Metadata for KV cache transfer engine configuration.PagedCacheInputSymbols: Symbolic inputs for paged KV cache operations.TransferReqData: Data structure for KV cache transfer requests.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!