Skip to main content

Python module

config

Configuration classes for MAX pipelines.

AudioGenerationConfig

class max.pipelines.lib.config.AudioGenerationConfig(audio_decoder, audio_decoder_weights='', chunk_size=None, buffer=0, block_causal=False, prepend_prompt_speech_tokens='never', prepend_prompt_speech_tokens_causal=False, run_model_test_mode=False, prometheus_metrics_mode='instrument_only', *, config_file=None, section_name=None, pipeline_role='prefill_and_decode', max_batch_size=None, max_queue_size_tg=None, min_batch_size_tg=None, ep_size=1, ce_delay_ms=0.0, enable_prioritize_first_decode=False, enable_chunked_prefill=True, enable_in_flight_batching=False, max_num_steps=-1, max_batch_input_tokens=8192, zmq_endpoint_base=<factory>, execute_empty_batches=False, max_batch_total_tokens=None, debug_verify_replay=False, enable_overlap_scheduler=False, prefer_module_v3=False, model=<factory>, draft_model=None, sampling=<factory>, profiling=<factory>, lora=None, speculative=None, runtime=<factory>, audio_decoder_config=<factory>)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

  • audio_decoder (str)
  • audio_decoder_weights (str)
  • chunk_size (list[int] | None)
  • buffer (int)
  • block_causal (bool)
  • prepend_prompt_speech_tokens (Literal['never', 'once', 'rolling'])
  • prepend_prompt_speech_tokens_causal (bool)
  • run_model_test_mode (bool)
  • prometheus_metrics_mode (Literal['instrument_only', 'launch_server', 'launch_multiproc_server'])
  • config_file (str | None)
  • section_name (str | None)
  • pipeline_role (Literal['prefill_and_decode', 'prefill_only', 'decode_only'])
  • max_batch_size (int | None)
  • max_queue_size_tg (int | None)
  • min_batch_size_tg (int | None)
  • ep_size (int)
  • ce_delay_ms (float)
  • enable_prioritize_first_decode (bool)
  • enable_chunked_prefill (bool)
  • enable_in_flight_batching (bool)
  • max_num_steps (int)
  • max_batch_input_tokens (int)
  • zmq_endpoint_base (str)
  • execute_empty_batches (bool)
  • max_batch_total_tokens (int | None)
  • debug_verify_replay (bool)
  • enable_overlap_scheduler (bool)
  • prefer_module_v3 (bool)
  • model (MAXModelConfig)
  • draft_model (MAXModelConfig | None)
  • sampling (SamplingConfig)
  • profiling (ProfilingConfig)
  • lora (LoRAConfig | None)
  • speculative (SpeculativeConfig | None)
  • runtime (PipelineRuntimeConfig)
  • audio_decoder_config (dict[str, Any])

audio_decoder

audio_decoder: str

audio_decoder_config

audio_decoder_config: dict[str, Any]

audio_decoder_weights

audio_decoder_weights: str

block_causal

block_causal: bool

buffer

buffer: int

chunk_size

chunk_size: list[int] | None

from_flags()

classmethod from_flags(audio_flags, **config_flags)

Builds an AudioGenerationConfig from audio CLI flags and config kwargs.

Parameters:

Return type:

AudioGenerationConfig

model_config

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'strict': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init()

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

prepend_prompt_speech_tokens

prepend_prompt_speech_tokens: PrependPromptSpeechTokens

prepend_prompt_speech_tokens_causal

prepend_prompt_speech_tokens_causal: bool

prometheus_metrics_mode

prometheus_metrics_mode: PrometheusMetricsMode

KVCacheConfig

class max.pipelines.lib.config.KVCacheConfig(*, config_file=None, section_name=None, cache_strategy='model_default', kv_cache_page_size=128, enable_prefix_caching=True, enable_kvcache_swapping_to_host=False, device_memory_utilization=0.9, host_kvcache_swap_space_gb=50.0, kv_cache_format=None, disk_offload_dir=None, disk_offload_max_gb=50.0, disk_offload_direct_io=False, lmcache_config_file=None)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

  • config_file (str | None)
  • section_name (str | None)
  • cache_strategy (Literal['model_default', 'paged'])
  • kv_cache_page_size (int)
  • enable_prefix_caching (bool)
  • enable_kvcache_swapping_to_host (bool)
  • device_memory_utilization (float)
  • host_kvcache_swap_space_gb (float)
  • kv_cache_format (str | None)
  • disk_offload_dir (str | None)
  • disk_offload_max_gb (float)
  • disk_offload_direct_io (bool)
  • lmcache_config_file (str | None)

cache_dtype

property cache_dtype: DType

Returns the data type used for KV cache storage.

cache_strategy

cache_strategy: Literal['model_default', 'paged']

device_memory_utilization

device_memory_utilization: float

disk_offload_dir

disk_offload_dir: str | None

disk_offload_direct_io

disk_offload_direct_io: bool

disk_offload_max_gb

disk_offload_max_gb: float

enable_kvcache_swapping_to_host

enable_kvcache_swapping_to_host: bool

enable_prefix_caching

enable_prefix_caching: bool

host_kvcache_swap_space_gb

host_kvcache_swap_space_gb: float

kv_cache_format

kv_cache_format: str | None

kv_cache_page_size

kv_cache_page_size: int

lmcache_config_file

lmcache_config_file: str | None

model_config

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init()

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

to_params()

to_params(dtype, n_kv_heads, head_dim, num_layers, devices, data_parallel_degree=1, is_mla=False, kvcache_quant_config=None)

Return KVCacheParams built from this config.

Parameters:

  • dtype (DType) – Data type for KV cache storage.
  • n_kv_heads (int) – Total number of KV heads across all devices.
  • head_dim (int) – Dimension of each attention head.
  • num_layers (int) – Number of model layers.
  • devices (Sequence[DeviceRef]) – Devices that host the KV cache.
  • data_parallel_degree (int) – Degree of data parallelism.
  • is_mla (bool) – Whether the model uses Multi-Latent Attention.
  • kvcache_quant_config (KVCacheQuantizationConfig | None) – KV cache quantization configuration.

Returns:

The constructed KV cache parameters.

Return type:

KVCacheParams

LoRAConfig

class max.pipelines.lib.config.LoRAConfig(*, config_file=None, section_name=None, enable_lora=False, lora_paths=<factory>, max_lora_rank=16, max_num_loras=1)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

  • config_file (str | None)
  • section_name (str | None)
  • enable_lora (bool)
  • lora_paths (list[str])
  • max_lora_rank (int)
  • max_num_loras (int)

enable_lora

enable_lora: bool

lora_paths

lora_paths: list[str]

max_lora_rank

max_lora_rank: int

max_num_loras

max_num_loras: int

model_config

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init()

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

MAXModelConfig

class max.pipelines.lib.config.MAXModelConfig(*, config_file=None, section_name=None, use_subgraphs=True, data_parallel_degree=1, pool_embeddings=True, max_length=None, model_path='', served_model_name=None, weight_path=<factory>, quantization_encoding=None, allow_safetensors_weights_fp32_bf6_bidirectional_cast=False, huggingface_model_revision='main', huggingface_weight_revision='main', trust_remote_code=False, device_specs=<factory>, force_download=False, vision_config_overrides=<factory>, rope_type=None, enable_echo=False, chat_template=None, kv_cache=<factory>)

Initialize config, allowing tests/internal callers to seed PrivateAttrs.

Pydantic PrivateAttrs are not regular model fields, so they are not accepted as constructor kwargs by default. Some tests (and debugging utilities) intentionally seed _huggingface_config to avoid network access and to validate config override plumbing. Hence, we need to explicitly define this __init__ method to seed the PrivateAttr(s).

Parameters:

  • config_file (str | None)
  • section_name (str | None)
  • use_subgraphs (bool)
  • data_parallel_degree (int)
  • pool_embeddings (bool)
  • max_length (int | None)
  • model_path (str)
  • served_model_name (str | None)
  • weight_path (list[Path])
  • quantization_encoding (Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq'] | None)
  • allow_safetensors_weights_fp32_bf6_bidirectional_cast (bool)
  • huggingface_model_revision (str)
  • huggingface_weight_revision (str)
  • trust_remote_code (bool)
  • device_specs (list[DeviceSpec])
  • force_download (bool)
  • vision_config_overrides (dict[str, Any])
  • rope_type (Literal['none', 'normal', 'neox', 'longrope', 'yarn'] | None)
  • enable_echo (bool)
  • chat_template (Path | None)
  • kv_cache (KVCacheConfig)

allow_safetensors_weights_fp32_bf6_bidirectional_cast

allow_safetensors_weights_fp32_bf6_bidirectional_cast: bool

chat_template

chat_template: Path | None

create_kv_cache_config()

create_kv_cache_config(**kv_cache_kwargs)

Create and set the KV cache configuration with the given parameters.

This method creates a new KVCacheConfig from the provided keyword arguments and automatically sets the cache_dtype based on the model’s quantization encoding (or any explicit override in kv_cache_kwargs).

Parameters:

**kv_cache_kwargs – Keyword arguments to pass to KVCacheConfig constructor. Common options include:

  • cache_strategy: The KV cache strategy (continuous, paged, etc.)
  • kv_cache_page_size: Number of tokens per page for paged cache
  • enable_prefix_caching: Whether to enable prefix caching
  • device_memory_utilization: Fraction of device memory to use
  • cache_dtype: Override for the cache data type

Return type:

None

data_parallel_degree

data_parallel_degree: int

default_device_spec

property default_device_spec: DeviceSpec

Returns the default device spec for the model.

This is the first device spec in the list, used for device spec checks throughout config validation.

Returns:

The default device spec for the model.

device_specs

device_specs: list[DeviceSpec]

diffusers_config

property diffusers_config: dict[str, Any] | None

Retrieve the diffusers config for diffusion pipelines.

Note: For multiprocessing, __getstate__ clears _diffusers_config before pickling. Each worker process will reload the config fresh.

Returns:

The diffusers config dict if this is a diffusion pipeline, None otherwise. The dict will have a structure with “_class_name” and “components” keys, where each component includes “class_name” and “config_dict” fields.

enable_echo

enable_echo: bool

force_download

force_download: bool

generation_config

property generation_config: GenerationConfig

Retrieve the Hugging Face GenerationConfig for this model.

This property lazily loads the GenerationConfig from the model repository and caches it to avoid repeated remote fetches.

Returns:

The GenerationConfig for the model, containing generation parameters like max_length, temperature, top_p, etc. If loading fails, returns a default GenerationConfig.

graph_quantization_encoding

property graph_quantization_encoding: QuantizationEncoding | None

Converts the CLI encoding to a MAX Graph quantization encoding.

Returns:

The graph quantization encoding corresponding to the CLI encoding.

Raises:

ValueError – If no CLI encoding was specified.

huggingface_config

property huggingface_config: AutoConfig | None

Returns the Hugging Face model config (loaded on first access).

huggingface_model_repo

property huggingface_model_repo: HuggingFaceRepo

Returns the Hugging Face repo handle for the model.

huggingface_model_revision

huggingface_model_revision: str

huggingface_weight_repo

property huggingface_weight_repo: HuggingFaceRepo

Returns the Hugging Face repo handle for weight files.

huggingface_weight_repo_id

property huggingface_weight_repo_id: str

Returns the Hugging Face repo ID used for weight files.

huggingface_weight_revision

huggingface_weight_revision: str

kv_cache

kv_cache: KVCacheConfig

max_length

max_length: int | None

model_config

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_name

property model_name: str

Returns the served model name or model path.

model_path

model_path: str

model_post_init()

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

pool_embeddings

pool_embeddings: bool

quantization_encoding

quantization_encoding: SupportedEncoding | None

resolve()

resolve()

Validates and resolves the config.

This method is called after the model config is initialized, to ensure that all config fields have been initialized to a valid state. It will also set and update other fields which may not be determined / initialized in the default factory.

In order:

  1. Resolve chat_template if it’s a Path
  2. Validate that the device_specs provided are available
  3. Parse the weight path(s) and initialize the _weights_repo_id

Return type:

None

retrieve_chat_template()

retrieve_chat_template()

Returns the chat template string, or None if not set.

Return type:

str | None

rope_type

rope_type: RopeType | None

sampling_params_defaults

property sampling_params_defaults: SamplingParamsGenerationConfigDefaults

Returns sampling defaults derived from the generation config.

served_model_name

served_model_name: str | None

set_cache_dtype_given_quantization_encoding()

set_cache_dtype_given_quantization_encoding()

Determine the KV cache dtype based on quantization encoding configuration.

The dtype is determined in the following priority order:

  1. Explicit override from kv_cache.kv_cache_format (if set)
  2. Derived from the model’s quantization_encoding
  3. Falls back to float32 if no encoding is specified

Returns:

  • DType.float32 for float32, q4_k, q4_0, q6_k encodings
  • DType.bfloat16 for bfloat16, float8_e4m3fn, float4_e2m1fnx2, gptq encodings

Return type:

The DType to use for the KV cache. Typical values are

trust_remote_code

trust_remote_code: bool

use_subgraphs

use_subgraphs: bool

validate_and_resolve_quantization_encoding_weight_path()

validate_and_resolve_quantization_encoding_weight_path(default_encoding)

Verifies that the quantization encoding and weight path are consistent.

Parameters:

  • weight_path – The path to the weight file.
  • default_encoding (Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq']) – The default encoding to use if no encoding is provided.

Return type:

None

validate_and_resolve_rope_type()

validate_and_resolve_rope_type(arch_rope_type)

Resolves rope_type from architecture default if not set.

Parameters:

arch_rope_type (Literal['none', 'normal', 'neox', 'longrope', 'yarn'])

Return type:

None

validate_and_resolve_with_resolved_quantization_encoding()

validate_and_resolve_with_resolved_quantization_encoding(supported_encodings, default_weights_format)

Validates model path and weight path against resolved quantization encoding.

Also resolves the KV cache strategy and finalizes the encoding config.

Parameters:

  • supported_encodings (dict[Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq'], list[~typing.Literal['model_default', 'paged']]]) – A dictionary of supported encodings and their corresponding KV cache strategies.
  • default_weights_format (WeightsFormat) – The default weights format to use if no weights format is provided.

Return type:

None

validate_lora_compatibility()

validate_lora_compatibility()

Validates that LoRA configuration is compatible with model settings.

Raises:

ValueError – If LoRA is enabled but incompatible with current model configuration.

Return type:

None

validate_max_length()

classmethod validate_max_length(v)

Validate that max_length is non-negative if provided.

Parameters:

v (int | None)

Return type:

int | None

validate_multi_gpu_supported()

validate_multi_gpu_supported(multi_gpu_supported)

Validates that the model architecture supports multi-GPU inference.

Parameters:

multi_gpu_supported (bool) – Whether the model architecture supports multi-GPU inference.

Return type:

None

vision_config_overrides

vision_config_overrides: dict[str, Any]

weight_path

weight_path: list[Path]

weights_size()

weights_size()

Calculates the total size in bytes of all weight files in weight_path.

Attempts to find the weights locally first to avoid network calls, checking in the following order:

  1. If repo_type is "local", it checks if the path in weight_path exists directly as a local file path.
  2. Otherwise, if repo_type is "online", it first checks the local Hugging Face cache using huggingface_hub.try_to_load_from_cache(). If not found in the cache, it falls back to querying the Hugging Face Hub API via HuggingFaceRepo.size_of().

Returns:

The total size of all weight files in bytes.

Raises:

  • FileNotFoundError – If repo_type is "local" and a file specified in weight_path is not found within the local repo directory.
  • ValueError – If HuggingFaceRepo.size_of() fails to retrieve the file size from the Hugging Face Hub API (e.g., file metadata not available or API error).
  • RuntimeError – If the determined repo_type is unexpected.

Return type:

int

MAXModelConfigBase

class max.pipelines.lib.config.MAXModelConfigBase(*, config_file=None, section_name=None)

Abstract base class for all (required) MAX model configs.

This base class is used to configure a model to use for a pipeline, but also handy to sidestep the need to pass in optional fields when subclassing MAXModelConfig.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

  • config_file (str | None)
  • section_name (str | None)

model_config

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

PipelineConfig

class max.pipelines.lib.config.PipelineConfig(*, config_file=None, section_name=None, pipeline_role='prefill_and_decode', max_batch_size=None, max_queue_size_tg=None, min_batch_size_tg=None, ep_size=1, ce_delay_ms=0.0, enable_prioritize_first_decode=False, enable_chunked_prefill=True, enable_in_flight_batching=False, max_num_steps=-1, max_batch_input_tokens=8192, zmq_endpoint_base=<factory>, execute_empty_batches=False, max_batch_total_tokens=None, debug_verify_replay=False, enable_overlap_scheduler=False, prefer_module_v3=False, model=<factory>, draft_model=None, sampling=<factory>, profiling=<factory>, lora=None, speculative=None, runtime=<factory>)

Configuration for a pipeline.

WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

  • config_file (str | None)
  • section_name (str | None)
  • pipeline_role (Literal['prefill_and_decode', 'prefill_only', 'decode_only'])
  • max_batch_size (int | None)
  • max_queue_size_tg (int | None)
  • min_batch_size_tg (int | None)
  • ep_size (int)
  • ce_delay_ms (float)
  • enable_prioritize_first_decode (bool)
  • enable_chunked_prefill (bool)
  • enable_in_flight_batching (bool)
  • max_num_steps (int)
  • max_batch_input_tokens (int)
  • zmq_endpoint_base (str)
  • execute_empty_batches (bool)
  • max_batch_total_tokens (int | None)
  • debug_verify_replay (bool)
  • enable_overlap_scheduler (bool)
  • prefer_module_v3 (bool)
  • model (MAXModelConfig)
  • draft_model (MAXModelConfig | None)
  • sampling (SamplingConfig)
  • profiling (ProfilingConfig)
  • lora (LoRAConfig | None)
  • speculative (SpeculativeConfig | None)
  • runtime (PipelineRuntimeConfig)

ce_delay_ms

ce_delay_ms: float

configure_session()

configure_session(session)

Configure an InferenceSession with standard pipeline settings.

Parameters:

session (InferenceSession)

Return type:

None

debug_verify_replay

debug_verify_replay: bool

draft_model

draft_model: MAXModelConfig | None

enable_chunked_prefill

enable_chunked_prefill: bool

enable_in_flight_batching

enable_in_flight_batching: bool

enable_overlap_scheduler

enable_overlap_scheduler: bool

enable_prioritize_first_decode

enable_prioritize_first_decode: bool

ep_size

ep_size: int

execute_empty_batches

execute_empty_batches: bool

graph_quantization_encoding

property graph_quantization_encoding: QuantizationEncoding | None

Converts the CLI encoding to a MAX graph quantization encoding.

Returns:

The graph quantization encoding corresponding to the CLI encoding.

log_basic_config()

log_basic_config()

Log minimal pipeline configuration information.

Logs basic PipelineConfig options including model name, pipeline task, weight path, max_batch_size, max_seq_len, and reserved memory.

Return type:

None

log_pipeline_info()

log_pipeline_info()

Logs comprehensive pipeline and KVCache configuration information.

Retrieves all necessary information from self and the PIPELINE_REGISTRY. Raises an error if architecture is not found (which should not happen after config resolution).

Return type:

None

lora

lora: LoRAConfig | None

max_batch_input_tokens

max_batch_input_tokens: int

max_batch_size

max_batch_size: int | None

max_batch_total_tokens

max_batch_total_tokens: int | None

max_num_steps

max_num_steps: int

max_queue_size_tg

max_queue_size_tg: int | None

min_batch_size_tg

min_batch_size_tg: int | None

model

model: MAXModelConfig

model_config

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'strict': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init()

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

pipeline_role

pipeline_role: PipelineRole

prefer_module_v3

prefer_module_v3: bool

profiling

profiling: ProfilingConfig

resolve()

resolve()

Validates and resolves the config.

Called after the config is initialized to ensure all config fields are in a valid state.

Return type:

None

runtime

runtime: PipelineRuntimeConfig

sampling

sampling: SamplingConfig

speculative

speculative: SpeculativeConfig | None

zmq_endpoint_base

zmq_endpoint_base: str

ProfilingConfig

class max.pipelines.lib.config.ProfilingConfig(*, config_file=None, section_name=None, gpu_profiling='off')

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

  • config_file (str | None)
  • section_name (str | None)
  • gpu_profiling (Literal['off', 'on', 'detailed'])

gpu_profiling

gpu_profiling: GPUProfilingMode

model_config

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init()

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

SpeculativeConfig

class max.pipelines.lib.config.SpeculativeConfig(*, config_file=None, section_name=None, speculative_method=None, num_speculative_tokens=5)

Configuration for speculative decoding.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

  • config_file (str | None)
  • section_name (str | None)
  • speculative_method (Literal['standalone', 'eagle', 'mtp'] | None)
  • num_speculative_tokens (int)

is_eagle()

is_eagle()

Returns whether the speculative method is EAGLE (shared embedding/lm_head).

Return type:

bool

is_mtp()

is_mtp()

Returns whether the speculative method is MTP.

Return type:

bool

is_standalone()

is_standalone()

Returns whether the speculative method is a standalone model.

Return type:

bool

model_config

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'strict': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init()

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

  • self (BaseModel) – The BaseModel instance.
  • context (Any) – The context.

Return type:

None

num_speculative_tokens

num_speculative_tokens: int

speculative_method

speculative_method: SpeculativeMethod | None

is_float4_encoding()

max.pipelines.lib.config.is_float4_encoding(encoding)

Returns whether the given encoding is a float4 type.

Parameters:

encoding (Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq'])

Return type:

bool

parse_supported_encoding_from_file_name()

max.pipelines.lib.config.parse_supported_encoding_from_file_name(name)

Infers a SupportedEncoding from a file name string.

Parameters:

name (str)

Return type:

Literal[‘float32’, ‘bfloat16’, ‘q4_k’, ‘q4_0’, ‘q6_k’, ‘float8_e4m3fn’, ‘float4_e2m1fnx2’, ‘gptq’] | None

supported_encoding_dtype()

max.pipelines.lib.config.supported_encoding_dtype(encoding)

Returns the underlying model dtype for the given encoding.

Parameters:

encoding (Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq'])

Return type:

DType

supported_encoding_quantization()

max.pipelines.lib.config.supported_encoding_quantization(encoding)

Returns the QuantizationEncoding for the given encoding.

Parameters:

encoding (Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq'])

Return type:

QuantizationEncoding | None

supported_encoding_supported_devices()

max.pipelines.lib.config.supported_encoding_supported_devices(encoding)

Returns the devices that the given encoding is supported on.

Parameters:

encoding (Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq'])

Return type:

tuple[str, …]

supported_encoding_supported_on()

max.pipelines.lib.config.supported_encoding_supported_on(encoding, device_spec)

Returns whether the given encoding is supported on a device.

Parameters:

  • encoding (Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq'])
  • device_spec (DeviceSpec)

Return type:

bool

Was this page helpful?