Skip to main content

Python class

ArchConfigWithAttentionKVCache

ArchConfigWithAttentionKVCache

class max.pipelines.lib.interfaces.ArchConfigWithAttentionKVCache(dtype, devices=<factory>, cache_dtype=None, kv_cache=<factory>, data_parallel_degree=1, user_provided_max_length=None, huggingface_config=None, _kv_params=None)

source

Bases: ArchConfigWithKVCache, ABC

Predefined configuration for architectures that use attention KV cache blocks.

Subclasses must define the following attributes:

  • num_key_value_heads: int
  • head_dim: int
  • num_layers: int
  • model_max_seq_len: int

Parameters:

cache_dtype

cache_dtype: DType | None = None

source

The data type to use for the KV cache.

data_parallel_degree

data_parallel_degree: int = 1

source

The data parallel degree to use when running the model.

devices

devices: list[DeviceRef]

source

The physical devices to use when running the model.

dtype

dtype: DType

source

The data type to use for the model.

get_kv_params()

get_kv_params()

source

Returns the KV cache parameters for this architecture.

Return type:

KVCacheParams

get_max_seq_len()

get_max_seq_len()

source

Returns the maximum sequence length the model can process.

Returns max_length if set, otherwise model_max_seq_len. Raises ValueError if max_length exceeds model_max_seq_len.

Return type:

int

head_dim

abstract property head_dim: int

source

Dimensionality of each attention head.

huggingface_config

huggingface_config: AutoConfig | None = None

source

initialize()

classmethod initialize(pipeline_config, model_config=None)

source

Initialize the config from a PipelineConfig.

Parameters:

  • pipeline_config (PipelineConfig) – The pipeline configuration.
  • model_config (MAXModelConfig | None) – The model configuration to read from. When None (the default), pipeline_config.model is used. Pass an explicit config (e.g. pipeline_config.draft_model) to initialize the arch config for a different model.

Return type:

Self

kv_cache

kv_cache: KVCacheConfig

source

The KV cache configuration to use when running the model.

model_max_seq_len

abstract property model_max_seq_len: int

source

The maximum sequence length that can be processed by the model.

num_key_value_heads

abstract property num_key_value_heads: int

source

Number of key-value heads to use for the KV cache.

num_layers

abstract property num_layers: int

source

Number of hidden layers in the model.

user_provided_max_length

user_provided_max_length: int | None = None

source

Override for the maximum sequence length.