Python module
config
Standardized configuration for Pipeline Inference.
AudioGenerationConfig
class max.pipelines.lib.config.AudioGenerationConfig(audio_config: 'dict[str, str]', **kwargs: 'Any')
audio_decoder
audio_decoder*: str* = ''
The name of the audio decoder model architecture.
audio_decoder_weights
audio_decoder_weights*: str* = ''
The path to the audio decoder weights file.
audio_prompt_speakers
audio_prompt_speakers*: str* = ''
The path to the audio prompt speakers file.
block_causal
block_causal*: bool* = False
Whether prior buffered tokens should attend to tokens in the current block. Has no effect if buffer is not set.
block_sizes
The block sizes to use for streaming. If this is an int, then fixed-size blocks of the given size are used If this is a list, then variable block sizes are used.
buffer
The number of previous speech tokens to pass to the audio decoder on each generation step.
prepend_prompt_speech_tokens
prepend_prompt_speech_tokens*: PrependPromptSpeechTokens* = 'never'
Whether the prompt speech tokens should be forwarded to the audio decoder. If “never”, the prompt tokens are not forwarded. If “once”, the prompt tokens are only forwarded on the first block. If “always”, the prompt tokens are forwarded on all blocks.
prepend_prompt_speech_tokens_causal
prepend_prompt_speech_tokens_causal*: bool* = False
Whether the prompt speech tokens should attend to tokens in the currently generated audio block. Has no effect if prepend_prompt_speech_tokens is “never”. If False (default), the prompt tokens do not attend to the current block. If True, the prompt tokens attend to the current block.
PipelineConfig
class max.pipelines.lib.config.PipelineConfig(**kwargs)
Configuration for a pipeline.
WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.
-
Parameters:
-
kwargs (
Any
)
custom_architectures
A list of custom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name. Ex:
- my_module
- folder/path/to/import:my_module
Each module must expose an ARCHITECTURES list of architectures to register.
draft_model_config
property draft_model_config*: MAXModelConfig | None*
enable_chunked_prefill
enable_chunked_prefill*: bool* = True
Enable chunked prefill to split context encoding requests into multiple chunks based on ‘target_num_new_tokens’.
enable_echo
enable_echo*: bool* = False
Whether the model should be built with echo capabilities.
enable_in_flight_batching
enable_in_flight_batching*: bool* = False
When enabled, prioritizes token generation by batching it with context encoding requests.
engine
engine*: PipelineEngine | None* = None
Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage.
graph_quantization_encoding
property graph_quantization_encoding*: QuantizationEncoding | None*
Converts the CLI encoding to a MAX graph quantization encoding.
-
Returns:
-
The graph quantization encoding corresponding to the CLI encoding.
help()
static help()
Documentation for this config class. Return a dictionary of config options and their descriptions.
ignore_eos
ignore_eos*: bool* = False
Ignore EOS and continue generating tokens, even when an EOS variable is hit.
max_batch_size
Maximum batch size to execute with the model. This is set to one, to minimize memory consumption for the base case, in which a person is running a local server to test out MAX. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.
max_ce_batch_size
max_ce_batch_size*: int* = 192
Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max_batch_size.
max_length
Maximum sequence length of the model.
max_new_tokens
max_new_tokens*: int* = -1
Maximum number of new tokens to generate during a single inference pass of the model.
max_num_steps
max_num_steps*: int* = -1
The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).
model_config
property model_config*: MAXModelConfig*
pad_to_multiple_of
pad_to_multiple_of*: int* = 2
Pad input tensors to be a multiple of value provided.
pdl_level
pdl_level*: str* = '1'
Level of overlap of kernel launch via programmatic dependent grid control.
pipeline_role
pipeline_role*: PipelineRole* = 'prefill_and_decode'
Whether the pipeline should serve both a prefill or decode role or both.
pool_embeddings
pool_embeddings*: bool* = True
Whether to pool embedding outputs.
profiling_config
property profiling_config*: ProfilingConfig*
resolve()
resolve()
Validates and resolves the config.
This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.
-
Return type:
-
None
sampling_config
property sampling_config*: SamplingConfig*
target_num_new_tokens
The target number of un-encoded tokens to include in each batch. If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory.
use_experimental_kernels
use_experimental_kernels*: str* = 'false'
PrependPromptSpeechTokens
class max.pipelines.lib.config.PrependPromptSpeechTokens(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
ALWAYS
ALWAYS = 'always'
NEVER
NEVER = 'never'
ONCE
ONCE = 'once'
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!