Python package

kv_cache

KV cache management for efficient attention computation during inference.

This package provides implementations for managing key-value caches used in transformer models. The paged attention implementation enables efficient memory management by fragmenting cache memory into pages, allowing for better memory utilization and support for prefix caching.

Functions

load_kv_manager: Load and initialize a KV cache manager.
estimate_kv_cache_size: Estimate KV cache memory requirements.
infer_optimal_batch_size: Infer optimal batch size based on available cache memory.
available_port: Find an available TCP port for transfer engine communication.

Modules

registry: KV cache manager factory functions and utilities.
null_cache_manager: Null KV cache manager implementation.

Packages

paged_cache: Paged attention KV cache implementation.

Classes

PagedKVCacheManager: Manager for paged KV cache with data and tensor parallelism support.
NullKVCacheManager: Null KV cache manager for compile-only mode.
PagedCacheInputSymbols: Symbolic inputs for paged KV cache operations.
ResetPrefixCacheBackend: Backend component for coordinating prefix cache resets.
ResetPrefixCacheFrontend: Frontend component for triggering prefix cache resets.
KVTransferEngine: Manages KV cache transfers between devices in distributed settings.
KVTransferEngineMetadata: Metadata for KV cache transfer engine configuration.
TransferReqData: Data structure for KV cache transfer requests.

Functions​

Modules​

Packages​

Classes​

Functions

Modules

Packages

Classes