Python module

config

Standardized configuration for Pipeline Inference.

`AudioGenerationConfig`

class max.pipelines.lib.config.AudioGenerationConfig(audio_decoder, audio_decoder_weights='', chunk_size=None, buffer=0, block_causal=False, prepend_prompt_speech_tokens=PrependPromptSpeechTokens.NEVER, prepend_prompt_speech_tokens_causal=False, run_model_test_mode=False, prometheus_metrics_mode=PrometheusMetricsMode.INSTRUMENT_ONLY, *, config_file=None, section_name=None, max_length=None, pipeline_role=PipelineRole.PrefillAndDecode, max_batch_size=None, max_queue_size_tg=None, min_batch_size_tg=None, ep_size=1, ce_delay_ms=0.0, enable_prioritize_first_decode=False, enable_chunked_prefill=True, enable_in_flight_batching=False, max_num_steps=-1, max_batch_input_tokens=8192, enable_echo=False, pool_embeddings=True, chat_template=None, use_experimental_kernels='false', use_vendor_blas='false', pdl_level='0', custom_architectures=<factory>, zmq_endpoint_base=<factory>, execute_empty_batches=False, max_batch_total_tokens=None, force=False, kvcache_ce_watermark=0.95, enable_overlap_scheduler=False, use_legacy_module=True, defer_resolve=False, model=<factory>, draft_model=None, sampling=<factory>, profiling=<factory>, lora=None, speculative=None, audio_decoder_config=<factory>)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

audio_decoder (str)
audio_decoder_weights (str)
chunk_size (list[int] | None)
buffer (int)
block_causal (bool)
prepend_prompt_speech_tokens (PrependPromptSpeechTokens)
prepend_prompt_speech_tokens_causal (bool)
run_model_test_mode (bool)
prometheus_metrics_mode (PrometheusMetricsMode)
config_file (str | None)
section_name (str | None)
max_length (int | None)
pipeline_role (PipelineRole)
max_batch_size (int | None)
max_queue_size_tg (int | None)
min_batch_size_tg (int | None)
ep_size (int)
ce_delay_ms (float)
enable_prioritize_first_decode (bool)
enable_chunked_prefill (bool)
enable_in_flight_batching (bool)
max_num_steps (int)
max_batch_input_tokens (int)
enable_echo (bool)
pool_embeddings (bool)
chat_template (Path | None)
use_experimental_kernels (str)
use_vendor_blas (str)
pdl_level (str)
custom_architectures (list[str])
zmq_endpoint_base (str)
execute_empty_batches (bool)
max_batch_total_tokens (int | None)
force (bool)
kvcache_ce_watermark (float)
enable_overlap_scheduler (bool)
use_legacy_module (bool)
defer_resolve (bool)
model (MAXModelConfig)
draft_model (MAXModelConfig | None)
sampling (SamplingConfig)
profiling (ProfilingConfig)
lora (LoRAConfig | None)
speculative (SpeculativeConfig | None)
audio_decoder_config (dict[str, Any])

`audio_decoder`

audio_decoder: str

`audio_decoder_config`

audio_decoder_config: dict[str, Any]

`audio_decoder_weights`

audio_decoder_weights: str

`block_causal`

block_causal: bool

`buffer`

buffer: int

`chunk_size`

chunk_size: list[int] | None

`from_flags()`

classmethod from_flags(audio_flags, **config_flags)

Parameters:

audio_flags (dict[str, str])
config_flags (Any)

Return type:

AudioGenerationConfig

`model_config`

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

`model_post_init()`

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

self (BaseModel) – The BaseModel instance.
context (Any) – The context.

Return type:

None

`prepend_prompt_speech_tokens`

prepend_prompt_speech_tokens: PrependPromptSpeechTokens

`prepend_prompt_speech_tokens_causal`

prepend_prompt_speech_tokens_causal: bool

`prometheus_metrics_mode`

prometheus_metrics_mode: PrometheusMetricsMode

`PipelineConfig`

class max.pipelines.lib.config.PipelineConfig(*, config_file=None, section_name=None, max_length=None, pipeline_role=PipelineRole.PrefillAndDecode, max_batch_size=None, max_queue_size_tg=None, min_batch_size_tg=None, ep_size=1, ce_delay_ms=0.0, enable_prioritize_first_decode=False, enable_chunked_prefill=True, enable_in_flight_batching=False, max_num_steps=-1, max_batch_input_tokens=8192, enable_echo=False, pool_embeddings=True, chat_template=None, use_experimental_kernels='false', use_vendor_blas='false', pdl_level='0', custom_architectures=<factory>, zmq_endpoint_base=<factory>, execute_empty_batches=False, max_batch_total_tokens=None, force=False, kvcache_ce_watermark=0.95, enable_overlap_scheduler=False, use_legacy_module=True, defer_resolve=False, model=<factory>, draft_model=None, sampling=<factory>, profiling=<factory>, lora=None, speculative=None)

Configuration for a pipeline.

WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

config_file (str | None)
section_name (str | None)
max_length (int | None)
pipeline_role (PipelineRole)
max_batch_size (int | None)
max_queue_size_tg (int | None)
min_batch_size_tg (int | None)
ep_size (int)
ce_delay_ms (float)
enable_prioritize_first_decode (bool)
enable_chunked_prefill (bool)
enable_in_flight_batching (bool)
max_num_steps (int)
max_batch_input_tokens (int)
enable_echo (bool)
pool_embeddings (bool)
chat_template (Path | None)
use_experimental_kernels (str)
use_vendor_blas (str)
pdl_level (str)
custom_architectures (list[str])
zmq_endpoint_base (str)
execute_empty_batches (bool)
max_batch_total_tokens (int | None)
force (bool)
kvcache_ce_watermark (float)
enable_overlap_scheduler (bool)
use_legacy_module (bool)
defer_resolve (bool)
model (MAXModelConfig)
draft_model (MAXModelConfig | None)
sampling (SamplingConfig)
profiling (ProfilingConfig)
lora (LoRAConfig | None)
speculative (SpeculativeConfig | None)

`ce_delay_ms`

ce_delay_ms: float

`chat_template`

chat_template: Path | None

`configure_session()`

configure_session(session)

Configure an InferenceSession with standard pipeline settings.

Parameters:: session (InferenceSession)
Return type:: None

`custom_architectures`

custom_architectures: list[str]

`defer_resolve`

defer_resolve: bool

`draft_model`

draft_model: MAXModelConfig | None

`enable_chunked_prefill`

enable_chunked_prefill: bool

`enable_echo`

enable_echo: bool

`enable_in_flight_batching`

enable_in_flight_batching: bool

`enable_overlap_scheduler`

enable_overlap_scheduler: bool

`enable_prioritize_first_decode`

enable_prioritize_first_decode: bool

`ep_size`

ep_size: int

`execute_empty_batches`

execute_empty_batches: bool

`force`

force: bool

`graph_quantization_encoding`

property graph_quantization_encoding: QuantizationEncoding | None

Converts the CLI encoding to a MAX graph quantization encoding.

Returns:: The graph quantization encoding corresponding to the CLI encoding.

`kvcache_ce_watermark`

kvcache_ce_watermark: float

`log_basic_config()`

log_basic_config()

Log minimal pipeline configuration information.

Logs basic PipelineConfig options including model name, pipeline task, weight path, max_batch_size, max_seq_len, and reserved memory.

Return type:: None

`log_pipeline_info()`

log_pipeline_info()

Log comprehensive pipeline and KVCache configuration information.

Retrieves all necessary information from self and the PIPELINE_REGISTRY. Raises an error if architecture is not found (which should not happen after config resolution).

Return type:: None

`lora`

lora: LoRAConfig | None

`max_batch_input_tokens`

max_batch_input_tokens: int

`max_batch_size`

max_batch_size: int | None

`max_batch_total_tokens`

max_batch_total_tokens: int | None

`max_length`

max_length: int | None

`max_num_steps`

max_num_steps: int

`max_queue_size_tg`

max_queue_size_tg: int | None

`min_batch_size_tg`

min_batch_size_tg: int | None

`model`

model: MAXModelConfig

`model_config`

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

`model_post_init()`

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

self (BaseModel) – The BaseModel instance.
context (Any) – The context.

Return type:

None

`pdl_level`

pdl_level: str

`pipeline_role`

pipeline_role: PipelineRole

`pool_embeddings`

pool_embeddings: bool

`profiling`

profiling: ProfilingConfig

`resolve()`

resolve()

Validates and resolves the config.

This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.

Return type:: None

`retrieve_chat_template()`

retrieve_chat_template()

Return type:: str | None

`sampling`

sampling: SamplingConfig

`speculative`

speculative: SpeculativeConfig | None

`use_experimental_kernels`

use_experimental_kernels: str

`use_legacy_module`

use_legacy_module: bool

`use_vendor_blas`

use_vendor_blas: str

`zmq_endpoint_base`

zmq_endpoint_base: str

`PrependPromptSpeechTokens`

class max.pipelines.lib.config.PrependPromptSpeechTokens(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

`NEVER`

NEVER = 'never'

Never prepend the prompt speech tokens sent to the audio decoder.

`ONCE`

ONCE = 'once'

Prepend the prompt speech tokens to the first block of the audio decoder.

`ROLLING`

ROLLING = 'rolling'

Prepend the prompt speech tokens to the first block of the audio decoder, and to later blocks to reach the requested buffer size.

`PrometheusMetricsMode`

class max.pipelines.lib.config.PrometheusMetricsMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

`INSTRUMENT_ONLY`

INSTRUMENT_ONLY = 'instrument_only'

Instrument metrics through the Prometheus client library, relying on the application to handle the metrics server.

`LAUNCH_MULTIPROC_SERVER`

LAUNCH_MULTIPROC_SERVER = 'launch_multiproc_server'

Launch a Prometheus server in multiprocess mode to report metrics.

`LAUNCH_SERVER`

LAUNCH_SERVER = 'launch_server'

Launch a Prometheus server to handle metrics requests.

AudioGenerationConfig
PipelineConfig
PrependPromptSpeechTokens
PrometheusMetricsMode

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

AudioGenerationConfig​

audio_decoder​

audio_decoder_config​

audio_decoder_weights​

block_causal​

buffer​

chunk_size​

from_flags()​

model_config​

model_post_init()​

prepend_prompt_speech_tokens​

prepend_prompt_speech_tokens_causal​

prometheus_metrics_mode​

PipelineConfig​

ce_delay_ms​

chat_template​

configure_session()​

custom_architectures​

defer_resolve​

draft_model​

enable_chunked_prefill​

enable_echo​

enable_in_flight_batching​

enable_overlap_scheduler​

enable_prioritize_first_decode​

ep_size​

execute_empty_batches​

force​

graph_quantization_encoding​

kvcache_ce_watermark​

log_basic_config()​

log_pipeline_info()​

lora​

max_batch_input_tokens​

max_batch_size​

max_batch_total_tokens​

max_length​

max_num_steps​

max_queue_size_tg​

min_batch_size_tg​

model​

model_config​

model_post_init()​

pdl_level​

pipeline_role​

pool_embeddings​

profiling​

resolve()​

retrieve_chat_template()​

sampling​

speculative​

use_experimental_kernels​

use_legacy_module​

use_vendor_blas​

zmq_endpoint_base​

PrependPromptSpeechTokens​

NEVER​

ONCE​

ROLLING​

PrometheusMetricsMode​

INSTRUMENT_ONLY​

LAUNCH_MULTIPROC_SERVER​

LAUNCH_SERVER​

`AudioGenerationConfig`

`audio_decoder`

`audio_decoder_config`

`audio_decoder_weights`

`block_causal`

`buffer`

`chunk_size`

`from_flags()`

`model_config`

`model_post_init()`

`prepend_prompt_speech_tokens`

`prepend_prompt_speech_tokens_causal`

`prometheus_metrics_mode`

`PipelineConfig`

`ce_delay_ms`

`chat_template`

`configure_session()`

`custom_architectures`

`defer_resolve`

`draft_model`

`enable_chunked_prefill`

`enable_echo`

`enable_in_flight_batching`

`enable_overlap_scheduler`

`enable_prioritize_first_decode`

`ep_size`

`execute_empty_batches`

`force`

`graph_quantization_encoding`

`kvcache_ce_watermark`

`log_basic_config()`

`log_pipeline_info()`

`lora`

`max_batch_input_tokens`

`max_batch_size`

`max_batch_total_tokens`

`max_length`

`max_num_steps`

`max_queue_size_tg`

`min_batch_size_tg`

`model`

`model_config`

`model_post_init()`

`pdl_level`

`pipeline_role`

`pool_embeddings`

`profiling`

`resolve()`

`retrieve_chat_template()`

`sampling`

`speculative`

`use_experimental_kernels`

`use_legacy_module`

`use_vendor_blas`

`zmq_endpoint_base`

`PrependPromptSpeechTokens`

`NEVER`

`ONCE`

`ROLLING`

`PrometheusMetricsMode`

`INSTRUMENT_ONLY`

`LAUNCH_MULTIPROC_SERVER`

`LAUNCH_SERVER`