IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

ArchConfigWithAttentionKVCache

ArchConfigWithAttentionKVCache​

class max.pipelines.lib.interfaces.ArchConfigWithAttentionKVCache(dtype, devices=<factory>, cache_dtype=None, kv_cache=<factory>, data_parallel_degree=1, user_provided_max_length=None, huggingface_config=None, _kv_params=None)

source

Bases: ArchConfigWithKVCache, ABC

Predefined configuration for architectures that use attention KV cache blocks.

Subclasses must define the following attributes:

  • num_key_value_heads: int
  • head_dim: int
  • num_layers: int
  • model_max_seq_len: int

Parameters:

cache_dtype​

cache_dtype: DType | None = None

source

The data type to use for the KV cache.

data_parallel_degree​

data_parallel_degree: int = 1

source

The data parallel degree to use when running the model.

devices​

devices: list[DeviceRef]

source

The physical devices to use when running the model.

dtype​

dtype: DType

source

The data type to use for the model.

get_kv_params()​

get_kv_params()

source

Returns the KV cache parameters for this architecture.

Return type:

KVCacheParams

get_max_seq_len()​

get_max_seq_len()

source

Returns the maximum sequence length the model can process.

Returns max_length if set, otherwise model_max_seq_len. Raises ValueError if max_length exceeds model_max_seq_len.

Return type:

int

head_dim​

abstract property head_dim: int

source

Dimensionality of each attention head.

huggingface_config​

huggingface_config: AutoConfig | None = None

source

initialize()​

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

  • pipeline_config (PipelineConfig) – The pipeline configuration.
  • model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

kv_cache​

kv_cache: KVCacheConfig

source

The KV cache configuration to use when running the model.

model_max_seq_len​

abstract property model_max_seq_len: int

source

The maximum sequence length that can be processed by the model.

num_key_value_heads​

abstract property num_key_value_heads: int

source

Number of key-value heads to use for the KV cache.

num_layers​

abstract property num_layers: int

source

Number of hidden layers in the model.

user_provided_max_length​

user_provided_max_length: int | None = None

source

Override for the maximum sequence length.