Skip to main content

Python module

registry

Model registry, for tracking various model variants.

PipelineRegistry

class max.pipelines.lib.registry.PipelineRegistry(architectures)

Registry for managing supported model architectures and their pipelines.

This class maintains a collection of SupportedArchitecture instances, each defining how a particular model architecture should be loaded, configured, and executed.

Use PIPELINE_REGISTRY when you want to:

Parameters:

architectures (list[SupportedArchitecture])

get_active_diffusers_config()

get_active_diffusers_config(huggingface_repo)

Retrieves or creates a cached diffusers config for the given repository.

This method checks if the repository is a diffusion pipeline by looking for model_index.json. If found, it downloads and caches the config. If not found, returns None.

Parameters:

huggingface_repo (HuggingFaceRepo) – The HuggingFaceRepo containing the model.

Returns:

The diffusers config dict if this is a diffusion pipeline, None otherwise.

Return type:

dict | None

get_active_huggingface_config()

get_active_huggingface_config(huggingface_repo)

Retrieves or creates a cached Hugging Face AutoConfig for the given model.

Maintains a cache of Hugging Face configurations to avoid reloading them unnecessarily which incurs a Hugging Face Hub API call. If a config for the given model hasn’t been loaded before, it will create a new one using AutoConfig.from_pretrained() with the model’s settings.

Note: The cache key (HuggingFaceRepo) includes trust_remote_code in its hash, so configs with different trust settings are cached separately. For multiprocessing, each worker process has its own registry instance with an empty cache, so configs are loaded fresh in each worker.

Parameters:

huggingface_repo (HuggingFaceRepo) – The HuggingFaceRepo containing the model.

Returns:

The Hugging Face configuration object for the model.

Return type:

AutoConfig

get_active_tokenizer()

get_active_tokenizer(huggingface_repo)

Retrieves or creates a cached Hugging Face AutoTokenizer for the given model.

Maintains a cache of Hugging Face tokenizers to avoid reloading them unnecessarily which incurs a Hugging Face Hub API call. If a tokenizer for the given model hasn’t been loaded before, it will create a new one using AutoTokenizer.from_pretrained() with the model’s settings.

Parameters:

huggingface_repo (HuggingFaceRepo) – The HuggingFaceRepo containing the model.

Returns:

The Hugging Face tokenizer for the model.

Return type:

PreTrainedTokenizer | PreTrainedTokenizerFast

register()

register(architecture, *, allow_override=False)

Add new architecture to registry.

If multiple architectures share the same name but have different tasks, they are registered in a secondary lookup table keyed by (name, task).

Parameters:

Return type:

None

reset()

reset()

Clears all registered architectures (mainly for tests).

Return type:

None

retrieve()

retrieve(pipeline_config, task=PipelineTask.TEXT_GENERATION, override_architecture=None)

Retrieves the tokenizer and an instantiated pipeline for the config.

Parameters:

Return type:

tuple[PipelineTokenizer[Any, Any, Any], PipelineTypes]

retrieve_architecture()

retrieve_architecture(huggingface_repo, use_legacy_module=True, task=None)

Retrieve architecture matching the Hugging Face model config.

Parameters:

  • huggingface_repo (HuggingFaceRepo) – The Hugging Face repository to match against.
  • use_legacy_module (bool) – Whether to use legacy Module architecture (default=True). When True, appends “_Legacy” suffix to find legacy graph-based architecture. When False, uses the standard Hugging Face architecture name for new API.
  • task (PipelineTask | None) – Optional task to disambiguate when multiple architectures share the same name. If not provided and multiple architectures share the same name, the task will be inferred from the Hugging Face Hub’s pipeline_tag.

Returns:

The matching SupportedArchitecture or None if no match found.

Return type:

SupportedArchitecture | None

retrieve_context_type()

retrieve_context_type(pipeline_config, override_architecture=None, task=None)

Retrieve the context class type associated with the architecture for the given pipeline configuration.

The context type defines how the pipeline manages request state and inputs during model execution. Different architectures may use different context implementations that adhere to either the TextGenerationContext or EmbeddingsContext protocol.

Parameters:

  • pipeline_config (PipelineConfig) – The configuration for the pipeline.
  • override_architecture (str | None) – Optional architecture name to use instead of looking up based on the model repository. This is useful for cases like audio generation where the pipeline uses a different architecture (e.g., audio decoder) than the underlying model repository.
  • task (PipelineTask | None) – Optional pipeline task to disambiguate when multiple architectures share the same name but serve different tasks.

Returns:

The context class type associated with the architecture, which implements either the TextGenerationContext or EmbeddingsContext protocol.

Raises:

ValueError – If no supported architecture is found for the given model repository or override architecture name.

Return type:

type[TextGenerationContext] | type[EmbeddingsContext]

retrieve_factory()

retrieve_factory(pipeline_config, task=PipelineTask.TEXT_GENERATION, override_architecture=None)

Retrieves the tokenizer and a factory that creates the pipeline instance.

Parameters:

Return type:

tuple[PipelineTokenizer[Any, Any, Any], Callable[[], PipelineTypes]]

retrieve_pipeline_task()

retrieve_pipeline_task(pipeline_config)

Retrieves the pipeline task for the given pipeline configuration.

Parameters:

pipeline_config (PipelineConfig) – The configuration for the pipeline.

Returns:

The task associated with the architecture.

Raises:

ValueError – If no supported architecture is found for the given model repository.

Return type:

PipelineTask

retrieve_tokenizer()

retrieve_tokenizer(pipeline_config, override_architecture=None, task=None)

Retrieves a tokenizer for the given pipeline configuration.

Parameters:

  • pipeline_config (PipelineConfig) – Configuration for the pipeline
  • override_architecture (str | None) – Optional architecture override string
  • task (PipelineTask | None) – Optional pipeline task to disambiguate when multiple architectures share the same name but serve different tasks.

Returns:

The configured tokenizer

Return type:

PipelineTokenizer

Raises:

ValueError – If no architecture is found

SupportedArchitecture

class max.pipelines.lib.registry.SupportedArchitecture(name, example_repo_ids, default_encoding, supported_encodings, pipeline_model, task, tokenizer, default_weights_format, context_type, config, rope_type='none', weight_adapters=<factory>, multi_gpu_supported=False, required_arguments=<factory>, context_validators=<factory>, supports_empty_batches=False, requires_max_batch_context_length=False)

Represents a model architecture configuration for MAX pipelines.

Defines the components and settings required to support a specific model architecture within the MAX pipeline system. Each SupportedArchitecture instance encapsulates the model implementation, tokenizer, supported encodings, and other architecture-specific configuration.

New architectures should be registered into the PipelineRegistry using the register() method.

Example:

my_architecture = SupportedArchitecture(
    name="MyModelForCausalLM",  # Must match your Hugging Face model class name
    example_repo_ids=[
        "your-org/your-model-name",  # Add example model repository IDs
    ],
    default_encoding="q4_k",
    supported_encodings={
        "q4_k": ["paged"],
        "bfloat16": ["paged"],
        # Add other encodings your model supports
    },
    pipeline_model=MyModel,
    tokenizer=TextTokenizer,
    context_type=TextContext,
    config=MyModelConfig,  # Architecture-specific config class
    default_weights_format=WeightsFormat.safetensors,
    rope_type="none",
    weight_adapters={
        WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
        # Add other weight formats if needed
    },
    multi_gpu_supported=True,  # Set based on your implementation capabilities
    required_arguments={"some_arg": True},
    task=PipelineTask.TEXT_GENERATION,
)

Parameters:

config

config: type[ArchConfig]

The architecture-specific configuration class for the model.

This class must implement the ArchConfig protocol, providing an initialize method that creates a configuration instance from a PipelineConfig. For models with KV cache, this should be a class implementing ArchConfigWithKVCache to enable KV cache memory estimation.

context_type

context_type: type[TextGenerationContext] | type[EmbeddingsContext]

The context class type that this architecture uses for managing request state and inputs.

This should be a class (not an instance) that implements either the TextGenerationContext or EmbeddingsContext protocol, defining how the pipeline processes and tracks requests.

context_validators

context_validators: list[Callable[[TextContext | TextAndVisionContext], None]]

A list of callable validators that verify context inputs before model execution.

These validators are called during context creation to ensure inputs meet model-specific requirements. Validators should raise InputError for invalid inputs, providing early error detection before expensive model operations.

def validate_single_image(context: TextContext | TextAndVisionContext) -> None:
    if isinstance(context, TextAndVisionContext):
        if context.pixel_values and len(context.pixel_values) > 1:
            raise InputError(f"Model supports only 1 image, got {len(context.pixel_values)}")

my_architecture = SupportedArchitecture(
    # ... other fields ...
    context_validators=[validate_single_image],
)

default_encoding

default_encoding: Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq']

The default quantization encoding to use when no specific encoding is requested.

default_weights_format

default_weights_format: WeightsFormat

The weights format expected by the pipeline_model.

example_repo_ids

example_repo_ids: list[str]

A list of Hugging Face repository IDs that use this architecture for testing and validation purposes.

multi_gpu_supported

multi_gpu_supported: bool = False

Whether the architecture supports multi-GPU execution.

name

name: str

The name of the model architecture that must match the Hugging Face model class name.

pipeline_model

pipeline_model: type[PipelineModel[Any]]

The PipelineModel class that defines the model graph structure and execution logic.

required_arguments

required_arguments: dict[str, bool | int | float]

A dictionary specifying required values for PipelineConfig options.

requires_max_batch_context_length

requires_max_batch_context_length: bool = False

Whether the architecture requires a max batch context length to be specified.

If True and max_batch_context_length is not specified, we will default to the max sequence length of the model.

rope_type

rope_type: Literal['none', 'normal', 'neox', 'longrope', 'yarn'] = 'none'

The type of RoPE (Rotary Position Embedding) used by the model.

supported_encodings

supported_encodings: dict[Literal['float32', 'bfloat16', 'q4_k', 'q4_0', 'q6_k', 'float8_e4m3fn', 'float4_e2m1fnx2', 'gptq'], list[Literal['model_default', 'paged']]]

A dictionary mapping supported quantization encodings to their compatible KV cache strategies.

supports_empty_batches

supports_empty_batches: bool = False

Whether the architecture can handle empty batches during inference.

When set to True, the pipeline can process requests with zero-sized batches without errors. This is useful for certain execution modes and expert parallelism. Most architectures do not require empty batch support and should leave this as False.

task

task: PipelineTask

The pipeline task type that this architecture supports.

tokenizer

tokenizer: Callable[[...], PipelineTokenizer[Any, Any, Any]]

A callable that returns a PipelineTokenizer instance for preprocessing model inputs.

tokenizer_cls

property tokenizer_cls: type[PipelineTokenizer[Any, Any, Any]]

Returns the tokenizer class for this architecture.

weight_adapters

weight_adapters: dict[WeightsFormat, Callable[[...], dict[str, WeightData]]]

A dictionary of weight format adapters for converting checkpoints from different formats to the default format.

get_pipeline_for_task()

max.pipelines.lib.registry.get_pipeline_for_task(task, pipeline_config)

Returns the pipeline class for the given task and config.

Parameters:

  • task (PipelineTask) – The pipeline task (e.g. text generation, embeddings).
  • pipeline_config (PipelineConfig) – Pipeline configuration (may select speculative path).

Returns:

The pipeline class to use for this task and config.

Return type:

type[TextGenerationPipeline[TextContext]] | type[EmbeddingsPipeline] | type[AudioGeneratorPipeline] | type[PixelGenerationPipeline[Any]] | type[StandaloneSpeculativeDecodingPipeline] | type[SpeechTokenGenerationPipeline] | type[EAGLESpeculativeDecodingPipeline] | type[OverlapTextGenerationPipeline[TextContext]]

PIPELINE_REGISTRY

max.pipelines.lib.registry.PIPELINE_REGISTRY: PipelineRegistry

Global registry of supported model architectures.

This is the singleton PipelineRegistry instance you can use to register new MAX model architectures and query supported models.

Was this page helpful?