Python module
config
Standardized configuration for Pipeline Inference.
PipelineConfig
class max.pipelines.config.PipelineConfig(**kwargs: Any)
Configuration for a pipeline.
WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.
draft_model
Draft model for use during Speculative Decoding.
enable_chunked_prefill
enable_chunked_prefill*: bool* = True
Enable chunked prefill to split context encoding requests into multiple chunks based on ‘target_num_new_tokens’.
enable_echo
enable_echo*: bool* = False
Whether the model should be built with echo capabilities.
enable_in_flight_batching
enable_in_flight_batching*: bool* = False
When enabled, prioritizes token generation by batching it with context encoding requests. Requires chunked prefill.
engine
engine*: PipelineEngine | None* = None
Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage.
graph_quantization_encoding
property graph_quantization_encoding*: QuantizationEncoding | None*
Converts the CLI encoding to a MAX graph quantization encoding.
-
Returns:
The graph quantization encoding corresponding to the CLI encoding.
help()
Documentation for this config class. Return a dictionary of config options and their descriptions.
ignore_eos
ignore_eos*: bool* = False
Ignore EOS and continue generating tokens, even when an EOS variable is hit.
max_batch_size
Maximum batch size to execute with the model. This is set to one, to minimize memory consumption for the base case, in which a person is running a local server to test out MAX. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.
max_ce_batch_size
max_ce_batch_size*: int* = 192
Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max_batch_size.
max_length
Maximum sequence length of the model.
max_new_tokens
max_new_tokens*: int* = -1
Maximum number of new tokens to generate during a single inference pass of the model.
max_num_steps
max_num_steps*: int* = -1
The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).
model_config
property model_config*: MAXModelConfig*
pad_to_multiple_of
pad_to_multiple_of*: int* = 2
Pad input tensors to be a multiple of value provided.
pool_embeddings
pool_embeddings*: bool* = True
Whether to pool embedding outputs.
profiling_config
property profiling_config*: ProfilingConfig*
resolve()
resolve() → None
Validates and resolves the config.
This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.
rope_type
rope_type*: RopeType | None* = None
none | normal | neox. Only matters for GGUF weights.
-
Type:
Force using a specific rope type
sampling_config
property sampling_config*: SamplingConfig*
target_num_new_tokens
The target number of un-encoded tokens to include in each batch. If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory.
use_experimental_kernels
use_experimental_kernels*: str* = 'false'
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!