Python class
KVCacheConfig
KVCacheConfig
class max.pipelines.KVCacheConfig(*, config_file=None, section_name=None, kv_cache_page_size=128, enable_prefix_caching=True, kv_connector=None, kv_connector_config=None, device_memory_utilization=0.9, kv_cache_format=None)
Bases: ConfigFileModel
Configuration for the paged KV cache.
-
Parameters:
cache_dtype
property cache_dtype: DType
Returns the data type used for KV cache storage.
device_memory_utilization
device_memory_utilization: float
The fraction of available device memory the process should consume.
enable_prefix_caching
enable_prefix_caching: bool
Whether to enable prefix caching for the paged KV cache.
kv_cache_format
An override for the default data type of the KV cache.
kv_cache_page_size
kv_cache_page_size: int
The number of tokens in a single page in the paged KV cache.
kv_connector
kv_connector: KVConnectorType | None
Type of KV cache connector to use.
kv_connector_config
kv_connector_config: KVConnectorConfig | None
Connector-specific configuration overrides.
model_config
model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
model_post_init()
model_post_init(context, /)
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
-
Parameters:
-
- self (BaseModel) – The BaseModel instance.
- context (Any) – The context.
-
Return type:
-
None
to_params()
to_params(dtype, n_kv_heads, head_dim, num_layers, devices, data_parallel_degree=1, is_mla=False, num_q_heads=None, kvcache_quant_config=None, num_eagle_speculative_tokens=0)
Returns KVCacheParams built from this config.
-
Parameters:
-
- dtype (DType) – Data type for KV cache storage.
- n_kv_heads (int) – Total number of KV heads across all devices.
- head_dim (int) – Dimension of each attention head.
- num_layers (int) – Number of model layers.
- devices (Sequence[DeviceRef]) – Devices that host the KV cache.
- data_parallel_degree (int) – Degree of data parallelism.
- is_mla (bool) – Whether the model uses Multi-Latent Attention.
- num_q_heads (int | None) – Number of query attention heads. Required when
is_mlais True. - kvcache_quant_config (KVCacheQuantizationConfig | None) – KV cache quantization configuration.
- num_eagle_speculative_tokens (int) – Number of draft tokens to generate for EAGLE speculative decoding.
-
Returns:
-
The constructed KV cache parameters.
-
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!