Python module

config

Standardized configuration for Pipeline Inference.

`AudioGenerationConfig`

class max.pipelines.lib.config.AudioGenerationConfig(audio_decoder: 'str', audio_decoder_weights: 'str' = '', chunk_size: 'list[int] | None' = None, buffer: 'int' = 0, block_causal: 'bool' = False, prepend_prompt_speech_tokens: 'PrependPromptSpeechTokens' = <PrependPromptSpeechTokens.NEVER: 'never'>, prepend_prompt_speech_tokens_causal: 'bool' = False, run_model_test_mode: 'bool' = False, prometheus_metrics_mode: 'PrometheusMetricsMode' = <PrometheusMetricsMode.INSTRUMENT_ONLY: 'instrument_only'>, **kwargs: 'Any')

Parameters:

audio_decoder (str)
audio_decoder_weights (str)
chunk_size (list[int] | None)
buffer (int)
block_causal (bool)
prepend_prompt_speech_tokens (PrependPromptSpeechTokens)
prepend_prompt_speech_tokens_causal (bool)
run_model_test_mode (bool)
prometheus_metrics_mode (PrometheusMetricsMode)
kwargs (Any)

`audio_decoder`

audio_decoder: str = ''

The name of the audio decoder model architecture.

`audio_decoder_config`

audio_decoder_config: dict[str, Any]

Parameters to pass to the audio decoder model.

`audio_decoder_weights`

audio_decoder_weights: str = ''

The path to the audio decoder weights file.

`block_causal`

block_causal: bool = False

Whether prior buffered tokens should attend to tokens in the current block. Has no effect if buffer is not set.

`buffer`

buffer: int = 0

The number of previous speech tokens to pass to the audio decoder on each generation step.

`chunk_size`

chunk_size: list[int] | None = None

The chunk sizes to use for streaming. If this is an int, then fixed-size chunks of the given size are used If this is a list, then variable chunk sizes are used.

`from_flags()`

classmethod from_flags(audio_flags, **config_flags)

Parameters:

audio_flags (dict[str, str])
config_flags (Any)

Return type:

AudioGenerationConfig

`help()`

static help()

Documentation for this config class. Return a dictionary of config options and their descriptions.

Return type:: dict[str, str]

`prepend_prompt_speech_tokens`

prepend_prompt_speech_tokens: PrependPromptSpeechTokens = 'once'

Whether the prompt speech tokens should be forwarded to the audio decoder. If “never”, the prompt tokens are not forwarded. If “once”, the prompt tokens are only forwarded on the first block. If “always”, the prompt tokens are forwarded on all blocks.

`prepend_prompt_speech_tokens_causal`

prepend_prompt_speech_tokens_causal: bool = False

Whether the prompt speech tokens should attend to tokens in the currently generated audio block. Has no effect if prepend_prompt_speech_tokens is “never”. If False (default), the prompt tokens do not attend to the current block. If True, the prompt tokens attend to the current block.

`prometheus_metrics_mode`

prometheus_metrics_mode: PrometheusMetricsMode = 'instrument_only'

The mode to use for Prometheus metrics.

`PipelineConfig`

class max.pipelines.lib.config.PipelineConfig(**kwargs)

Configuration for a pipeline.

WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.

Parameters:: kwargs (Any)

`ce_delay_ms`

ce_delay_ms: float = 0.0

Duration of scheduler sleep prior to starting a prefill batch.

This is an experimental flag solely for the TTS scheduler. Do not use unless you know what you are doing.

`chat_template`

chat_template: Path | None = None

Optional custom chat template to override the one shipped with the HuggingFace model config. Can be either:

A Path pointing to a file containing the template

If a Path is provided, the file will be read during config resolution and the content will be stored as a string. This allows customizing the prompt formatting for different use cases. If None, the model’s default chat template will be used.

`custom_architectures`

custom_architectures: list[str]

A list of custom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name. Ex:

my_module
folder/path/to/import:my_module

Each module must expose an ARCHITECTURES list of architectures to register.

`draft_model_config`

property draft_model_config: MAXModelConfig | None

`enable_chunked_prefill`

enable_chunked_prefill: bool = True

Enable chunked prefill to split context encoding requests into multiple chunks based on ‘prefill_chunk_size’.

`enable_echo`

enable_echo: bool = False

Whether the model should be built with echo capabilities.

`enable_in_flight_batching`

enable_in_flight_batching: bool = False

When enabled, prioritizes token generation by batching it with context encoding requests.

`enable_prioritize_first_decode`

enable_prioritize_first_decode: bool = False

When enabled, the scheduler will always run a TG batch immediately after a CE batch, with the same requests. This may be useful for decreasing time-to-first-chunk latency.

This is an experimental flag solely for the TTS scheduler. Do not use unless you know what you are doing.

`ep_size`

ep_size: int = 1

The expert parallelism size. Needs to be 1 (no expert parallelism) or the total number of GPUs across nodes.

`experimental_background_queue`

experimental_background_queue: bool = False

When enabled, offloads queue draining to a background thread for improved performance.

This is an experimental flag. Use with caution.

`force`

force: bool = False

Skip validation of user provided flags against the architecture’s required arguments.

`graph_quantization_encoding`

property graph_quantization_encoding: QuantizationEncoding | None

Converts the CLI encoding to a MAX graph quantization encoding.

Returns:: The graph quantization encoding corresponding to the CLI encoding.

`help()`

static help()

Documentation for this config class. Return a dictionary of config options and their descriptions.

Return type:: dict[str, str]

`log_basic_config()`

log_basic_config()

Log minimal pipeline configuration information.

Logs basic PipelineConfig options including model name, pipeline task, weight path, max_batch_size, max_seq_len, and reserved memory.

Return type:: None

`log_pipeline_info()`

log_pipeline_info()

Log comprehensive pipeline and KVCache configuration information.

Retrieves all necessary information from self and the PIPELINE_REGISTRY. Raises an error if architecture is not found (which should not happen after config resolution).

Return type:: None

`lora_config`

property lora_config: LoRAConfig | None

`max_batch_size`

max_batch_size: int | None = None

Maximum batch size to execute with the model. When not specified (None), we determine this value dynamically. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.

`max_ce_batch_size`

max_ce_batch_size: int = 192

Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max_batch_size.

`max_length`

max_length: int | None = None

Maximum sequence length of the model.

`max_num_steps`

max_num_steps: int = -1

The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).

`max_queue_size_tg`

max_queue_size_tg: int | None = None

Maximum number of requests in decode queue. By default, this is max-batch-size.

`min_batch_size_tg`

min_batch_size_tg: int | None = None

Specifies a soft floor on the decode batch size.

If the TG batch size is larger than this value, the scheduler will continue to run TG batches. If it falls below, the scheduler will prioritize CE. Note that this is NOT a strict minimum! By default, this is max-queue-size-tg.

This is an experimental flag solely for the TTS scheduler. Do not use unless you know what you are doing.

`model_config`

property model_config: MAXModelConfig

`pdl_level`

pdl_level: str = '0'

Level of overlap of kernel launch via programmatic dependent grid control.

`pipeline_role`

pipeline_role: PipelineRole = 'prefill_and_decode'

Whether the pipeline should serve both a prefill or decode role or both.

`pool_embeddings`

pool_embeddings: bool = True

Whether to pool embedding outputs.

`prefill_chunk_size`

prefill_chunk_size: int = 8192

The target number of un-encoded tokens to include in each batch. This value is used for chunked prefill and memory estimation.

`profiling_config`

property profiling_config: ProfilingConfig

`resolve()`

resolve()

Validates and resolves the config.

This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.

Return type:: None

`retrieve_chat_template()`

retrieve_chat_template()

Return type:: str | None

`sampling_config`

property sampling_config: SamplingConfig

`use_experimental_kernels`

use_experimental_kernels: str = 'false'

`zmq_endpoint_base`

zmq_endpoint_base: str

The prefix for the ZMQ endpoints used for IPC. This prefix ensures that the ZMQ endpoints are unique across multiple MAX Serve instances running on the same host. This should be randomly generated when the PipelineConfig is created. Ex:

lora_request_zmq_endpoint: f”{zmq_endpoint_base}-lora_request”
lora_response_zmq_endpoint: f”{zmq_endpoint_base}-lora_response”

`PrependPromptSpeechTokens`

class max.pipelines.lib.config.PrependPromptSpeechTokens(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

`NEVER`

NEVER = 'never'

Never prepend the prompt speech tokens sent to the audio decoder.

`ONCE`

ONCE = 'once'

Prepend the prompt speech tokens to the first block of the audio decoder.

`ROLLING`

ROLLING = 'rolling'

Prepend the prompt speech tokens to the first block of the audio decoder, and to later blocks to reach the requested buffer size.

`PrometheusMetricsMode`

class max.pipelines.lib.config.PrometheusMetricsMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

`INSTRUMENT_ONLY`

INSTRUMENT_ONLY = 'instrument_only'

Instrument metrics through the Prometheus client library, relying on the application to handle the metrics server.

`LAUNCH_MULTIPROC_SERVER`

LAUNCH_MULTIPROC_SERVER = 'launch_multiproc_server'

Launch a Prometheus server in multiprocess mode to report metrics.

`LAUNCH_SERVER`

LAUNCH_SERVER = 'launch_server'

Launch a Prometheus server to handle metrics requests.

AudioGenerationConfig
PipelineConfig
PrependPromptSpeechTokens
PrometheusMetricsMode

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

AudioGenerationConfig​

audio_decoder​

audio_decoder_config​

audio_decoder_weights​

block_causal​

buffer​

chunk_size​

from_flags()​

help()​

prepend_prompt_speech_tokens​

prepend_prompt_speech_tokens_causal​

prometheus_metrics_mode​

PipelineConfig​

ce_delay_ms​

chat_template​

custom_architectures​

draft_model_config​

enable_chunked_prefill​

enable_echo​

enable_in_flight_batching​

enable_prioritize_first_decode​

ep_size​

experimental_background_queue​

force​

graph_quantization_encoding​

help()​

log_basic_config()​

log_pipeline_info()​

lora_config​

max_batch_size​

max_ce_batch_size​

max_length​

max_num_steps​

max_queue_size_tg​

min_batch_size_tg​

model_config​

pdl_level​

pipeline_role​

pool_embeddings​

prefill_chunk_size​

profiling_config​

resolve()​

retrieve_chat_template()​

sampling_config​

use_experimental_kernels​

zmq_endpoint_base​

PrependPromptSpeechTokens​

NEVER​

ONCE​

ROLLING​

PrometheusMetricsMode​

INSTRUMENT_ONLY​

LAUNCH_MULTIPROC_SERVER​

LAUNCH_SERVER​

`AudioGenerationConfig`

`audio_decoder`

`audio_decoder_config`

`audio_decoder_weights`

`block_causal`

`buffer`

`chunk_size`

`from_flags()`

`help()`

`prepend_prompt_speech_tokens`

`prepend_prompt_speech_tokens_causal`

`prometheus_metrics_mode`

`PipelineConfig`

`ce_delay_ms`

`chat_template`

`custom_architectures`

`draft_model_config`

`enable_chunked_prefill`

`enable_echo`

`enable_in_flight_batching`

`enable_prioritize_first_decode`

`ep_size`

`experimental_background_queue`

`force`

`graph_quantization_encoding`

`help()`

`log_basic_config()`

`log_pipeline_info()`

`lora_config`

`max_batch_size`

`max_ce_batch_size`

`max_length`

`max_num_steps`

`max_queue_size_tg`

`min_batch_size_tg`

`model_config`

`pdl_level`

`pipeline_role`

`pool_embeddings`

`prefill_chunk_size`

`profiling_config`

`resolve()`

`retrieve_chat_template()`

`sampling_config`

`use_experimental_kernels`

`zmq_endpoint_base`

`PrependPromptSpeechTokens`

`NEVER`

`ONCE`

`ROLLING`

`PrometheusMetricsMode`

`INSTRUMENT_ONLY`

`LAUNCH_MULTIPROC_SERVER`

`LAUNCH_SERVER`