Skip to main content
Log in

Python module

config

Standardized configuration for Pipeline Inference.

PipelineConfig

class max.pipelines.config.PipelineConfig(**kwargs: Any)

Configuration for a pipeline.

WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default.

draft_model

draft_model*: str | None* = None

Draft model for use during Speculative Decoding.

enable_chunked_prefill

enable_chunked_prefill*: bool* = True

Enable chunked prefill to split context encoding requests into multiple chunks based on ‘target_num_new_tokens’.

enable_echo

enable_echo*: bool* = False

Whether the model should be built with echo capabilities.

enable_in_flight_batching

enable_in_flight_batching*: bool* = False

When enabled, prioritizes token generation by batching it with context encoding requests. Requires chunked prefill.

engine

engine*: PipelineEngine | None* = None

Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage.

graph_quantization_encoding

property graph_quantization_encoding*: QuantizationEncoding | None*

Converts the CLI encoding to a MAX graph quantization encoding.

  • Returns:

    The graph quantization encoding corresponding to the CLI encoding.

help()

static help() → dict[str, str]

Documentation for this config class. Return a dictionary of config options and their descriptions.

ignore_eos

ignore_eos*: bool* = False

Ignore EOS and continue generating tokens, even when an EOS variable is hit.

max_batch_size

max_batch_size*: int | None* = None

Maximum batch size to execute with the model. This is set to one, to minimize memory consumption for the base case, in which a person is running a local server to test out MAX. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.

max_ce_batch_size

max_ce_batch_size*: int* = 192

Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max_batch_size.

max_length

max_length*: int | None* = None

Maximum sequence length of the model.

max_new_tokens

max_new_tokens*: int* = -1

Maximum number of new tokens to generate during a single inference pass of the model.

max_num_steps

max_num_steps*: int* = -1

The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).

model_config

property model_config*: MAXModelConfig*

pad_to_multiple_of

pad_to_multiple_of*: int* = 2

Pad input tensors to be a multiple of value provided.

pool_embeddings

pool_embeddings*: bool* = True

Whether to pool embedding outputs.

profiling_config

property profiling_config*: ProfilingConfig*

resolve()

resolve() → None

Validates and resolves the config.

This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state.

rope_type

rope_type*: RopeType | None* = None

none | normal | neox. Only matters for GGUF weights.

  • Type:

    Force using a specific rope type

sampling_config

property sampling_config*: SamplingConfig*

target_num_new_tokens

target_num_new_tokens*: int | None* = None

The target number of un-encoded tokens to include in each batch. If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory.

use_experimental_kernels

use_experimental_kernels*: str* = 'false'