Skip to main content

Python class

KVCacheConfig

KVCacheConfig

class max.pipelines.KVCacheConfig(*, config_file=None, section_name=None, kv_cache_page_size=128, enable_prefix_caching=True, kv_connector=None, kv_connector_config=None, device_memory_utilization=0.9, kv_cache_format=None)

source

Bases: ConfigFileModel

Configuration for the paged KV cache.

Parameters:

  • config_file (str | None)
  • section_name (str | None)
  • kv_cache_page_size (int)
  • enable_prefix_caching (bool)
  • kv_connector (KVConnectorType | None)
  • kv_connector_config (KVConnectorConfig | None)
  • device_memory_utilization (float)
  • kv_cache_format (str | None)

cache_dtype

property cache_dtype: DType

source

Returns the data type used for KV cache storage.

device_memory_utilization

device_memory_utilization: float

source

The fraction of available device memory the process should consume.

enable_prefix_caching

enable_prefix_caching: bool

source

Whether to enable prefix caching for the paged KV cache.

kv_cache_format

kv_cache_format: str | None

source

An override for the default data type of the KV cache.

kv_cache_page_size

kv_cache_page_size: int

source

The number of tokens in a single page in the paged KV cache.

kv_connector

kv_connector: KVConnectorType | None

source

Type of KV cache connector to use.

kv_connector_config

kv_connector_config: KVConnectorConfig | None

source

Connector-specific configuration overrides.

model_config

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}

source

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init()

model_post_init(context, /)

source

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

to_params()

to_params(dtype, n_kv_heads, head_dim, num_layers, devices, data_parallel_degree=1, is_mla=False, num_q_heads=None, kvcache_quant_config=None, num_eagle_speculative_tokens=0)

source

Returns KVCacheParams built from this config.

Parameters:

  • dtype (DType) – Data type for KV cache storage.
  • n_kv_heads (int) – Total number of KV heads across all devices.
  • head_dim (int) – Dimension of each attention head.
  • num_layers (int) – Number of model layers.
  • devices (Sequence[DeviceRef]) – Devices that host the KV cache.
  • data_parallel_degree (int) – Degree of data parallelism.
  • is_mla (bool) – Whether the model uses Multi-Latent Attention.
  • num_q_heads (int | None) – Number of query attention heads. Required when is_mla is True.
  • kvcache_quant_config (KVCacheQuantizationConfig | None) – KV cache quantization configuration.
  • num_eagle_speculative_tokens (int) – Number of draft tokens to generate for EAGLE speculative decoding.

Returns:

The constructed KV cache parameters.

Return type:

KVCacheParams