IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Model development overview

The Model development section explains how to write model architectures that MAX doesn't support out of the box. You'll learn how to use the MAX Python API to convert trained models from Hugging Face or PyTorch into models that MAX can serve.

While MAX supports multiple modalities (including image and video generation), the guides in this section currently focus on transformer-based text generation models, as they follow the most stable and established implementation paradigms.

What you implement​

MAX provides the serving infrastructure, runtime, and model execution features required to run a model in production. When you add support for a new architecture, you only need to implement the architecture-specific components.

To add support for a new model architecture, implement the:

  • Model graph: Define the model layers (modules), its attention pattern, and data flow.
  • Weight adapter: Define how to map the Hugging Face checkpoint key names to the corresponding weight names for each layer in your MAX model.
  • Pipeline model: Define the pipeline that connects the graph to the serving interface.
  • Configuration: Translate Hugging Face's config.json fields into parameters that MAX needs to build the graph and allocate its caches.
  • Architecture registry: Define how to connect all the pieces, such as the model, tokenizer, weight loader, and quantization formats.

Once you've integrated the model architecture into the MAX inference pipeline, MAX provides the following functionality for you:

  • Serving: Request scheduling, continuous batching, and an OpenAI-compatible endpoint API.
  • KV cache management: Paged allocation, cache eviction, prefix caching, and chunked prefill.
  • Parallelism: Data parallelism, tensor parallelism, weight sharding, collective communication, and device management.
  • Tokenization: Wrapping Hugging Face tokenizers for the serving loop.
  • Weight loading: Reading safetensors and GGUF files from disk or Hugging Face Hub.
  • Compilation and execution: The graph compiler, kernel fusion, and runtime, with built-in quantization support for types like FP8 and FP4.

Architecture components​

Each model architecture in MAX requires the following five components, each supported by a corresponding part of the MAX Python API.

Model graph​

You'll use Module subclasses to define the graph computation layers, and assemble them into a Graph object that MAX uses to compile the model.

Standard architectures like Llama 3 compose entirely from built-in modules such as Linear, MLP, AttentionWithRope, Embedding, RMSNorm, Transformer, TransformerBlock, and more. You'll need a custom Module subclass only for model-specific behavior (such as novel attention or custom MoE gating).

Learn more in Build a model graph with Module.

Weight adapter​

You need a simple function that renames Hugging Face checkpoint keys to match your Module hierarchy. Usually this is simple string replacement: the Llama 3 adapter just strips a "model." prefix. The adapter is registered in SupportedArchitecture and runs once when MAX loads the model.

Pipeline model​

This is the central class that builds, compiles, and executes the model graph. For models with a KV cache (most LLMs), you'll create it as a subclass of PipelineModelWithKVCache.

When instantiated, the pipeline model receives the InferenceSession used at runtime and is responsible for:

  • Graph construction: Assemble Module layers, call load_state_dict(state_dict) to bind the adapted checkpoint weights, and build the Graph object.
  • Compilation: Call session.compile(graph) to compile the graph into a CompiledModel, then session.init(compiled, weights_registry=state_dict) to bind weights and produce an executable Model.
  • Input preparation: Convert tokenized requests into graph inputs.
  • Execution: Its execute() function is called by the serving loop to run the compiled model and return the logits.

Learn more in Model pipelines.

Configuration​

Your config class (in model_config.py) implements the ArchConfig protocol. Its initialize() class method reads the Hugging Face config.json and pipeline settings (devices, quantization, cache sizing) to produce a config object. The model hyperparameters pass through unchanged: your config class is necessary because Hugging Face configs aren't standardized across model families (different field names, nesting conventions, derived vs. explicit values).

Learn more in Serve custom model architectures.

Architecture registry​

Every model architecture needs an arch.py file that connects all the pieces.

It's just a SupportedArchitecture object that registers your model to the pipeline system. It wires together your pipeline model class, config class, tokenizer, weight adapter, and supported quantization formats. When a user serves a model based on a Hugging Face repo ID, MAX looks up this registration to find your code.

Learn more in Serve custom model architectures.

Deploy your model​

After you build your model architecture in MAX, you can test it locally and optionally contribute it back to MAX so everybody can use it.

Test locally​

Use the --custom-architectures flag to serve a model with your local architecture implementation.

max serve --model your-org/your-model \
  --custom-architectures path/to/your/architecture

For details, see Serve custom model architectures.

Contribute to MAX​

To make your architecture available to all MAX users:

  1. Register the architecture in architectures/__init__.py.
  2. Submit a pull request.

After your changes are merged, users can serve compatible models by specifying their Hugging Face repository ID with max serve.

For details, see Contributing new model architectures.

Was this page helpful?