For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

KVCacheConfig

`KVCacheConfig`

class max.pipelines.kv_cache.KVCacheConfig(*, config_file=None, section_name=None, kv_cache_page_size=128, enable_prefix_caching=True, enable_dp_cross_replica_prefix_copy=True, kv_connector=None, kv_connector_config=None, device_memory_utilization=0.9, allow_kv_head_replication=False, kv_cache_format=None, kv_cache_hash_algo='ahash64', kv_cache_hash_seed=None)

source

Bases: ConfigFileModel

Configuration for the paged KV cache.

Parameters:

config_file (str | None)
section_name (str | None)
kv_cache_page_size (int)
enable_prefix_caching (bool)
enable_dp_cross_replica_prefix_copy (bool)
kv_connector (KVConnectorType | None)
kv_connector_config (KVConnectorConfig | None)
device_memory_utilization (float)
allow_kv_head_replication (bool)
kv_cache_format (str | None)
kv_cache_hash_algo (Literal['ahash64', 'sha256', 'sha256_64'])
kv_cache_hash_seed (str | None)

`allow_kv_head_replication`

allow_kv_head_replication: bool

source

Default for to_params()’s allow_kv_head_replication argument.

`cache_dtype`

property cache_dtype: DType

source

Returns the data type used for KV cache storage.

`config_file`

config_file: str | None

source

Path to the configuration file.

`device_memory_utilization`

device_memory_utilization: float

source

The fraction of available device memory the process should consume.

`enable_dp_cross_replica_prefix_copy`

enable_dp_cross_replica_prefix_copy: bool

source

Whether DP cross-replica prefix-cache hits may be served by device-to-device copies.

`enable_prefix_caching`

enable_prefix_caching: bool

source

Whether to enable prefix caching for the paged KV cache.

`kv_cache_format`

kv_cache_format: str | None

source

An override for the default data type of the KV cache.

`kv_cache_hash_algo`

kv_cache_hash_algo: KVHashAlgo

source

Hash algorithm used for KV-cache block identity.

`kv_cache_hash_seed`

kv_cache_hash_seed: str | None

source

Optional 32-byte hex seed for sha256/sha256_64 hashing.

`kv_cache_page_size`

kv_cache_page_size: int

source

The number of tokens in a single page in the paged KV cache.

`kv_connector`

kv_connector: KVConnectorType | None

source

Type of KV cache connector to use.

`kv_connector_config`

kv_connector_config: KVConnectorConfig | None

source

Connector-specific configuration overrides.

`model_config`

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}

source

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

`model_post_init()`

model_post_init(context, /)

source

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

self (BaseModel) – The BaseModel instance.
context (Any) – The context.

Return type:

None

`section_name`

section_name: str | None

source

Optional section name for comprehensive/multi-section config files.

If not provided, values are loaded from the YAML top-level (treating the file as an “individual config” file).

`to_params()`

to_params(dtype, n_kv_heads, head_dim, num_layers, devices, data_parallel_degree=1, is_mla=False, num_q_heads=None, kvcache_quant_config=None, speculative_method=None, num_draft_tokens=0, allow_kv_head_replication=None)

source

Returns KVCacheParams built from this config.

Selects the attention-type-specific subclass: a MLAKVCacheParams when is_mla is set, otherwise a MHAKVCacheParams.

Parameters:

dtype (DType) – Data type for KV cache storage.
n_kv_heads (int) – Total number of KV heads across all devices.
head_dim (int) – Dimension of each attention head.
num_layers (int) – Number of model layers.
devices (Sequence[DeviceRef]) – Devices that host the KV cache.
data_parallel_degree (int) – Degree of data parallelism.
is_mla (bool) – Whether the model uses Multi-Latent Attention.
num_q_heads (int | None) – Number of query attention heads. Required when is_mla is True.
kvcache_quant_config (KVCacheQuantizationConfig | None) – KV cache quantization configuration.
speculative_method (Literal['eagle', 'mtp', 'dflash'] | None) – Speculative decoding method propagated from SpeculativeConfig. None when speculative decoding is disabled.
num_draft_tokens (int) – Total draft tokens generated per speculative iteration. Zero when no speculative decoding.
allow_kv_head_replication (bool | None) – Replicate KV heads for TP wider than the KV head count. Defaults to None (falls back to the config’s allow_kv_head_replication).

Returns:

The constructed KV cache parameters.

Return type:

KVCacheParams

KVCacheConfig​

allow_kv_head_replication​

cache_dtype​

config_file​

device_memory_utilization​

enable_dp_cross_replica_prefix_copy​

enable_prefix_caching​

kv_cache_format​

kv_cache_hash_algo​

kv_cache_hash_seed​

kv_cache_page_size​

kv_connector​

kv_connector_config​

model_config​

model_post_init()​

section_name​

to_params()​

`KVCacheConfig`

`allow_kv_head_replication`

`cache_dtype`

`config_file`

`device_memory_utilization`

`enable_dp_cross_replica_prefix_copy`

`enable_prefix_caching`

`kv_cache_format`

`kv_cache_hash_algo`

`kv_cache_hash_seed`

`kv_cache_page_size`

`kv_connector`

`kv_connector_config`

`model_config`

`model_post_init()`

`section_name`

`to_params()`