For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Model development overview

The Model development section explains how to write model architectures that MAX doesn't support out of the box. You'll learn how to use the MAX Python API to convert trained models from Hugging Face or PyTorch into models that MAX can serve.

While MAX supports multiple modalities (including image and video generation), the guides in this section currently focus on transformer-based text generation models, as they follow the most stable and established implementation paradigms.

Stable and experimental APIs

Most guides in this section demonstrate usage of the max.graph and max.nn packages. These are the stable APIs used by all production MAX model architectures. However, MAX also provides an experimental API (max.experimental) that you can learn about in the Eager fundamentals section. Be aware that its namespace may change in the future as the API matures.

What you implement

MAX provides the serving infrastructure, runtime, and model execution features required to run a model in production. When you add support for a new architecture, you only need to implement the architecture-specific components.

To add support for a new model architecture, implement the:

Model graph: Define the model layers (modules), its attention pattern, and data flow.
Weight adapter: Define how to map the Hugging Face checkpoint key names to the corresponding weight names for each layer in your MAX model.
Pipeline model: Define the pipeline that connects the graph to the serving interface.
Configuration: Translate Hugging Face's config.json fields into parameters that MAX needs to build the graph and allocate its caches.
Architecture registry: Define how to connect all the pieces, such as the model, tokenizer, weight loader, and quantization formats.

Once you've integrated the model architecture into the MAX inference pipeline, MAX provides the following functionality for you:

Serving: Request scheduling, continuous batching, and an OpenAI-compatible endpoint API.
KV cache management: Paged allocation, cache eviction, prefix caching, and chunked prefill.
Parallelism: Data parallelism, tensor parallelism, weight sharding, collective communication, and device management.
Tokenization: Wrapping Hugging Face tokenizers for the serving loop.
Weight loading: Reading safetensors and GGUF files from disk or Hugging Face Hub.
Compilation and execution: The graph compiler, kernel fusion, and runtime, with built-in quantization support for types like FP8 and FP4.

Architecture components

Each model architecture in MAX requires the following five components, each supported by a corresponding part of the MAX Python API.

Many models are implemented as variations of an existing architecture. If your model is based on an existing architecture such as Llama, Qwen, or DeepSeek, you can often reuse the existing implementation and override only the components that differ.

Model graph

You'll use Module subclasses to define the graph computation layers, and assemble them into a Graph object that MAX uses to compile the model.

Standard architectures like Llama 3 compose entirely from built-in modules such as Linear, MLP, AttentionWithRope, Embedding, RMSNorm, Transformer, TransformerBlock, and more. You'll need a custom Module subclass only for model-specific behavior (such as novel attention or custom MoE gating).

Learn more in Build a model graph with Module.

Weight adapter

You need a simple function that renames Hugging Face checkpoint keys to match your Module hierarchy. Usually this is simple string replacement: the Llama 3 adapter just strips a "model." prefix. The adapter is registered in SupportedArchitecture and runs once when MAX loads the model.

Pipeline model

This is the central class that builds, compiles, and executes the model graph. For models with a KV cache (most LLMs), you'll create it as a subclass of PipelineModelWithKVCache.

When instantiated, the pipeline model receives the InferenceSession used at runtime and is responsible for:

Graph construction: Assemble Module layers, call load_state_dict(state_dict) to bind the adapted checkpoint weights, and build the Graph object.
Compilation: Call session.compile(graph) to compile the graph into a CompiledModel, then session.init(compiled, weights_registry=state_dict) to bind weights and produce an executable Model.
Input preparation: Convert tokenized requests into graph inputs.
Execution: Its execute() function is called by the serving loop to run the compiled model and return the logits.

Learn more in Model pipelines.

Configuration

Your config class (in model_config.py) implements the ArchConfig protocol. Its initialize() class method reads the Hugging Face config.json and pipeline settings (devices, quantization, cache sizing) to produce a config object. The model hyperparameters pass through unchanged: your config class is necessary because Hugging Face configs aren't standardized across model families (different field names, nesting conventions, derived vs. explicit values).

Learn more in Serve custom model architectures.

Architecture registry

Every model architecture needs an arch.py file that connects all the pieces.

It's just a SupportedArchitecture object that registers your model to the pipeline system. It wires together your pipeline model class, config class, tokenizer, weight adapter, and supported quantization formats. When a user serves a model based on a Hugging Face repo ID, MAX looks up this registration to find your code.

Learn more in Serve custom model architectures.

Deploy your model

After you build your model architecture in MAX, you can test it locally and optionally contribute it back to MAX so everybody can use it.

Test locally

Use the --custom-architectures flag to serve a model with your local architecture implementation.

max serve --model your-org/your-model \
  --custom-architectures path/to/your/architecture

For details, see Serve custom model architectures.

Contribute to MAX

To make your architecture available to all MAX users:

Register the architecture in architectures/__init__.py.
Submit a pull request.

After your changes are merged, users can serve compatible models by specifying their Hugging Face repository ID with max serve.

For details, see Contributing new model architectures.

What you implement​

Architecture components​

Model graph​

Weight adapter​

Pipeline model​

Configuration​

Architecture registry​

Deploy your model​

Test locally​

Contribute to MAX​