For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Model development overview
The Model development section explains how to write model architectures that MAX doesn't support out of the box. You'll learn how to use the MAX Python API to convert trained models from Hugging Face or PyTorch into models that MAX can serve.
While MAX supports multiple modalities (including image and video generation), the guides in this section currently focus on transformer-based text generation models, as they follow the most stable and established implementation paradigms.
What you implementβ
MAX provides the serving infrastructure, runtime, and model execution features required to run a model in production. When you add support for a new architecture, you only need to implement the architecture-specific components.
To add support for a new model architecture, implement the:
- Model graph: Define the model layers (modules), its attention pattern, and data flow.
- Weight adapter: Define how to map the Hugging Face checkpoint key names to the corresponding weight names for each layer in your MAX model.
- Pipeline model: Define the pipeline that connects the graph to the serving interface.
- Configuration: Translate Hugging Face's
config.jsonfields into parameters that MAX needs to build the graph and allocate its caches. - Architecture registry: Define how to connect all the pieces, such as the model, tokenizer, weight loader, and quantization formats.
Once you've integrated the model architecture into the MAX inference pipeline, MAX provides the following functionality for you:
- Serving: Request scheduling, continuous batching, and an OpenAI-compatible endpoint API.
- KV cache management: Paged allocation, cache eviction, prefix caching, and chunked prefill.
- Parallelism: Data parallelism, tensor parallelism, weight sharding, collective communication, and device management.
- Tokenization: Wrapping Hugging Face tokenizers for the serving loop.
- Weight loading: Reading safetensors and GGUF files from disk or Hugging Face Hub.
- Compilation and execution: The graph compiler, kernel fusion, and runtime, with built-in quantization support for types like FP8 and FP4.
Architecture componentsβ
Each model architecture in MAX requires the following five components, each supported by a corresponding part of the MAX Python API.
Model graphβ
You'll use Module subclasses to define the graph computation layers, and
assemble them into a Graph object that MAX uses to compile the model.
Standard architectures like Llama 3 compose entirely from built-in modules such
as Linear, MLP, AttentionWithRope, Embedding, RMSNorm, Transformer,
TransformerBlock, and more. You'll need a custom Module subclass only for
model-specific behavior (such as novel attention or custom MoE gating).
Learn more in Build a model graph with Module.
Weight adapterβ
You need a simple function that renames Hugging Face checkpoint keys to match
your Module hierarchy. Usually this is simple string replacement: the Llama 3
adapter just strips a "model." prefix. The adapter is registered in
SupportedArchitecture and runs once when MAX loads the model.
Pipeline modelβ
This is the central class that builds, compiles, and executes the model graph.
For models with a KV cache (most LLMs), you'll create it as a subclass of
PipelineModelWithKVCache.
When instantiated, the pipeline model receives the InferenceSession used at
runtime and is responsible for:
- Graph construction: Assemble
Modulelayers, callload_state_dict(state_dict)to bind the adapted checkpoint weights, and build theGraphobject. - Compilation: Call
session.compile(graph)to compile the graph into aCompiledModel, thensession.init(compiled, weights_registry=state_dict)to bind weights and produce an executableModel. - Input preparation: Convert tokenized requests into graph inputs.
- Execution: Its
execute()function is called by the serving loop to run the compiled model and return the logits.
Learn more in Model pipelines.
Configurationβ
Your config class (in model_config.py) implements the ArchConfig protocol.
Its initialize() class method reads the Hugging Face config.json and
pipeline settings (devices, quantization, cache sizing) to produce a config
object. The model hyperparameters pass through unchanged: your config class is
necessary because Hugging Face configs aren't standardized across model families
(different field names, nesting conventions, derived vs. explicit values).
Learn more in Serve custom model architectures.
Architecture registryβ
Every model architecture needs an arch.py file that connects all the pieces.
It's just a SupportedArchitecture object that registers your model to the
pipeline system. It wires together your pipeline model class, config class,
tokenizer, weight adapter, and supported quantization formats. When a user
serves a model based on a Hugging Face repo ID, MAX looks up this registration
to find your code.
Learn more in Serve custom model architectures.
Deploy your modelβ
After you build your model architecture in MAX, you can test it locally and optionally contribute it back to MAX so everybody can use it.
Test locallyβ
Use the --custom-architectures flag to serve a model with your local
architecture implementation.
max serve --model your-org/your-model \
--custom-architectures path/to/your/architectureFor details, see Serve custom model architectures.
Contribute to MAXβ
To make your architecture available to all MAX users:
- Register the architecture in
architectures/__init__.py. - Submit a pull request.
After your changes are merged, users can serve compatible models by specifying
their Hugging Face repository ID with max serve.
For details, see Contributing new model architectures.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!