Python class

KVCacheConfig

`KVCacheConfig`

class max.pipelines.KVCacheConfig(*, config_file=None, section_name=None, kv_cache_page_size=128, enable_prefix_caching=True, kv_connector=None, kv_connector_config=None, device_memory_utilization=0.9, kv_cache_format=None)

source

Bases: ConfigFileModel

Configuration for the paged KV cache.

Parameters:

config_file (str | None)
section_name (str | None)
kv_cache_page_size (int)
enable_prefix_caching (bool)
kv_connector (KVConnectorType | None)
kv_connector_config (KVConnectorConfig | None)
device_memory_utilization (float)
kv_cache_format (str | None)

`cache_dtype`

property cache_dtype: DType

source

Returns the data type used for KV cache storage.

`device_memory_utilization`

device_memory_utilization: float

source

The fraction of available device memory the process should consume.

`enable_prefix_caching`

enable_prefix_caching: bool

source

Whether to enable prefix caching for the paged KV cache.

`kv_cache_format`

kv_cache_format: str | None

source

An override for the default data type of the KV cache.

`kv_cache_page_size`

kv_cache_page_size: int

source

The number of tokens in a single page in the paged KV cache.

`kv_connector`

kv_connector: KVConnectorType | None

source

Type of KV cache connector to use.

`kv_connector_config`

kv_connector_config: KVConnectorConfig | None

source

Connector-specific configuration overrides.

`model_config`

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}

source

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

`model_post_init()`

model_post_init(context, /)

source

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

self (BaseModel) – The BaseModel instance.
context (Any) – The context.

Return type:

None

`to_params()`

to_params(dtype, n_kv_heads, head_dim, num_layers, devices, data_parallel_degree=1, is_mla=False, num_q_heads=None, kvcache_quant_config=None, num_eagle_speculative_tokens=0)

source

Returns KVCacheParams built from this config.

Parameters:

dtype (DType) – Data type for KV cache storage.
n_kv_heads (int) – Total number of KV heads across all devices.
head_dim (int) – Dimension of each attention head.
num_layers (int) – Number of model layers.
devices (Sequence[DeviceRef]) – Devices that host the KV cache.
data_parallel_degree (int) – Degree of data parallelism.
is_mla (bool) – Whether the model uses Multi-Latent Attention.
num_q_heads (int | None) – Number of query attention heads. Required when is_mla is True.
kvcache_quant_config (KVCacheQuantizationConfig | None) – KV cache quantization configuration.
num_eagle_speculative_tokens (int) – Number of draft tokens to generate for EAGLE speculative decoding.

Returns:

The constructed KV cache parameters.

Return type:

KVCacheParams

KVCacheConfig​

cache_dtype​

device_memory_utilization​

enable_prefix_caching​

kv_cache_format​

kv_cache_page_size​

kv_connector​

kv_connector_config​

model_config​

model_post_init()​

to_params()​

`KVCacheConfig`

`cache_dtype`

`device_memory_utilization`

`enable_prefix_caching`

`kv_cache_format`

`kv_cache_page_size`

`kv_connector`

`kv_connector_config`

`model_config`

`model_post_init()`

`to_params()`