IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

KVCacheConfig

KVCacheConfig​

class max.pipelines.kv_cache.KVCacheConfig(*, config_file=None, section_name=None, kv_cache_page_size=128, enable_prefix_caching=True, kv_connector=None, kv_connector_config=None, device_memory_utilization=0.9, kv_cache_format=None)

source

Bases: ConfigFileModel

Configuration for the paged KV cache.

Parameters:

cache_dtype​

property cache_dtype: DType

source

Returns the data type used for KV cache storage.

config_file​

config_file: str | None

source

Path to the configuration file.

device_memory_utilization​

device_memory_utilization: float

source

The fraction of available device memory the process should consume.

enable_prefix_caching​

enable_prefix_caching: bool

source

Whether to enable prefix caching for the paged KV cache.

kv_cache_format​

kv_cache_format: str | None

source

An override for the default data type of the KV cache.

kv_cache_page_size​

kv_cache_page_size: int

source

The number of tokens in a single page in the paged KV cache.

kv_connector​

kv_connector: KVConnectorType | None

source

Type of KV cache connector to use.

kv_connector_config​

kv_connector_config: KVConnectorConfig | None

source

Connector-specific configuration overrides.

model_config​

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}

source

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init()​

model_post_init(context, /)

source

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

section_name​

section_name: str | None

source

Optional section name for comprehensive/multi-section config files.

If not provided, values are loaded from the YAML top-level (treating the file as an β€œindividual config” file).

to_params()​

to_params(dtype, n_kv_heads, head_dim, num_layers, devices, data_parallel_degree=1, is_mla=False, num_q_heads=None, kvcache_quant_config=None, speculative_method=None, num_draft_tokens=0)

source

Returns KVCacheParams built from this config.

Parameters:

  • dtype (DType) – Data type for KV cache storage.
  • n_kv_heads (int) – Total number of KV heads across all devices.
  • head_dim (int) – Dimension of each attention head.
  • num_layers (int) – Number of model layers.
  • devices (Sequence[DeviceRef]) – Devices that host the KV cache.
  • data_parallel_degree (int) – Degree of data parallelism.
  • is_mla (bool) – Whether the model uses Multi-Latent Attention.
  • num_q_heads (int | None) – Number of query attention heads. Required when is_mla is True.
  • kvcache_quant_config (KVCacheQuantizationConfig | None) – KV cache quantization configuration.
  • speculative_method (Literal['standalone', 'eagle', 'mtp', 'dflash'] | None) – Speculative decoding method propagated from SpeculativeConfig. None when speculative decoding is disabled.
  • num_draft_tokens (int) – Total draft tokens generated per speculative iteration. Zero when no speculative decoding.

Returns:

The constructed KV cache parameters.

Return type:

KVCacheParams