For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

ArchConfigWithAttentionKVCache

`ArchConfigWithAttentionKVCache`

class max.pipelines.lib.interfaces.ArchConfigWithAttentionKVCache(dtype, devices=<factory>, cache_dtype=None, kv_cache=<factory>, data_parallel_degree=1, user_provided_max_length=None, huggingface_config=None, _kv_params=None)

source

Bases: ArchConfigWithKVCache, ABC

Predefined configuration for architectures that use attention KV cache blocks.

Subclasses must define the following attributes:

num_key_value_heads: int
head_dim: int
num_layers: int
model_max_seq_len: int

Parameters:

dtype (DType)
devices (list[DeviceRef])
cache_dtype (DType | None)
kv_cache (KVCacheConfig)
data_parallel_degree (int)
user_provided_max_length (int | None)
huggingface_config (AutoConfig | None)
_kv_params (KVCacheParams | None)

`cache_dtype`

cache_dtype: DType | None = None

source

The data type to use for the KV cache.

`data_parallel_degree`

data_parallel_degree: int = 1

source

The data parallel degree to use when running the model.

`devices`

devices: list[DeviceRef]

source

The physical devices to use when running the model.

`dtype`

dtype: DType

source

The data type to use for the model.

`get_kv_params()`

get_kv_params()

source

Returns the KV cache parameters for this architecture.

Return type:: KVCacheParams

`get_max_seq_len()`

get_max_seq_len()

source

Returns the maximum sequence length the model can process.

Returns max_length if set, otherwise model_max_seq_len. Raises ValueError if max_length exceeds model_max_seq_len.

Return type:: int

`head_dim`

abstract property head_dim: int

source

Dimensionality of each attention head.

`huggingface_config`

huggingface_config: AutoConfig | None = None

source

`initialize()`

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

pipeline_config (PipelineConfig) – The pipeline configuration.
model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

`kv_cache`

kv_cache: KVCacheConfig

source

The KV cache configuration to use when running the model.

`model_max_seq_len`

abstract property model_max_seq_len: int

source

The maximum sequence length that can be processed by the model.

`num_key_value_heads`

abstract property num_key_value_heads: int

source

Number of key-value heads to use for the KV cache.

`num_layers`

abstract property num_layers: int

source

Number of hidden layers in the model.

`user_provided_max_length`

user_provided_max_length: int | None = None

source

Override for the maximum sequence length.

ArchConfigWithAttentionKVCache​

cache_dtype​

data_parallel_degree​

devices​

dtype​

get_kv_params()​

get_max_seq_len()​

head_dim​

huggingface_config​

initialize()​

kv_cache​

model_max_seq_len​

num_key_value_heads​

num_layers​

user_provided_max_length​