Serve custom model architectures

max

custom-architecture

serve

pipelines

MAX comes with built-in support for popular model architectures like Gemma3ForCausalLM, Qwen2ForCausalLM, and LlamaForCausalLM, so you can instantly deploy them by passing a specific Hugging Face model name to the max serve command (explore our model repo). You can also use MAX to serve a custom model architecture with the max serve command, which provides an OpenAI-compatible API.

In this tutorial, you'll implement a custom architecture based on the Qwen2 model by extending MAX's existing Llama3 implementation. This approach demonstrates how to leverage MAX's built-in architectures to quickly support new models with similar structures. By the end of this tutorial, you'll understand how to:

Set up the required file structure for custom architectures.
Extend existing MAX model implementations.
Register your model architecture with MAX.
Serve your model and make inference requests.

Set up your environment

Create a Python project and install the necessary dependencies:

pixi
uv
pip
conda

If you don't have it, install pixi:

curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh

Then restart your terminal for the changes to take effect.

Create a project:

pixi init qwen2 \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd qwen2
pixi init qwen2 \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd qwen2

Install the modular Python package:

Nightly
Stable

pixi add modular
pixi add modular

pixi add "modular=25.4"
pixi add "modular=25.4"

Start the virtual environment:
```
pixi shell
```
```
pixi shell
```

If you don't have it, install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh

Then restart your terminal to make uv accessible.

Create a project:

uv init qwen2 && cd qwen2
uv init qwen2 && cd qwen2

Create and start a virtual environment:

uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow
uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/ \
  --index-strategy unsafe-best-match
uv pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/ \
  --index-strategy unsafe-best-match

Create a project folder:

mkdir qwen2 && cd qwen2
mkdir qwen2 && cd qwen2

Create and activate a virtual environment:

python3 -m venv .venv/qwen2 \
  && source .venv/qwen2/bin/activate
python3 -m venv .venv/qwen2 \
  && source .venv/qwen2/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/
pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/
pip install modular \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
```
conda init zsh
```
Then restart your terminal for the changes to take effect.

Create a project:

conda create -n qwen2
conda create -n qwen2

Start the virtual environment:

conda activate qwen2

conda activate qwen2

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular

Understand the architecture structure

Before creating your custom architecture, let's understand how to organize your custom model project. Create the following structure in your project directory:

qwen2/
  ├── __init__.py
  ├── arch.py
  └── model.py
qwen2/
  ├── __init__.py
  ├── arch.py
  └── model.py

Here's what each file does:

__init__.py: Makes your architecture discoverable by MAX.
arch.py: Registers your model with MAX, specifying supported encodings, capabilities, and which existing components to reuse.
model.py: Contains your model implementation that extends an existing MAX model class.

When extending an existing architecture, you can often reuse configuration handling and weight adapters from the parent model, significantly reducing the amount of code you need to write.

Implement the main model class

When your model is similar to an existing architecture, you can extend that model class instead of building from scratch. In this example, we'll extend the Llama3Model class to implement the Qwen2Model class:

model.py
from __future__ import annotations

from typing import Optional

from max.driver import Device
from max.engine import InferenceSession
from max.graph.weights import Weights, WeightsAdapter
from max.nn import ReturnLogits
from max.pipelines.architectures.llama3.model import Llama3Model
from max.pipelines.lib import KVCacheConfig, PipelineConfig, SupportedEncoding
from transformers import AutoConfig


class Qwen2Model(Llama3Model):
    """Qwen2 pipeline model implementation."""

    attention_bias: bool = True
    """Whether to use attention bias."""

    def __init__(
        self,
        pipeline_config: PipelineConfig,
        session: InferenceSession,
        huggingface_config: AutoConfig,
        encoding: SupportedEncoding,
        devices: list[Device],
        kv_cache_config: KVCacheConfig,
        weights: Weights,
        adapter: Optional[WeightsAdapter] = None,
        return_logits: ReturnLogits = ReturnLogits.LAST_TOKEN,
    ) -> None:
        super().__init__(
            pipeline_config,
            session,
            huggingface_config,
            encoding,
            devices,
            kv_cache_config,
            weights,
            adapter,
            return_logits,
        )
from __future__ import annotations

from typing import Optional

from max.driver import Device
from max.engine import InferenceSession
from max.graph.weights import Weights, WeightsAdapter
from max.nn import ReturnLogits
from max.pipelines.architectures.llama3.model import Llama3Model
from max.pipelines.lib import KVCacheConfig, PipelineConfig, SupportedEncoding
from transformers import AutoConfig


class Qwen2Model(Llama3Model):
    """Qwen2 pipeline model implementation."""

    attention_bias: bool = True
    """Whether to use attention bias."""

    def __init__(
        self,
        pipeline_config: PipelineConfig,
        session: InferenceSession,
        huggingface_config: AutoConfig,
        encoding: SupportedEncoding,
        devices: list[Device],
        kv_cache_config: KVCacheConfig,
        weights: Weights,
        adapter: Optional[WeightsAdapter] = None,
        return_logits: ReturnLogits = ReturnLogits.LAST_TOKEN,
    ) -> None:
        super().__init__(
            pipeline_config,
            session,
            huggingface_config,
            encoding,
            devices,
            kv_cache_config,
            weights,
            adapter,
            return_logits,
        )

By inheriting from Llama3Model, the Qwen2 implementation automatically gets:

The execute, prepare_initial_token_inputs, and prepare_next_token_inputs methods required by MAX.
Graph building logic for transformer architectures.
Configuration handling from Hugging Face models.
Weight loading and conversion capabilities.

The only modification needed is setting attention_bias = True to match Qwen2's architecture specifics. This approach works because Qwen2 and Llama3 share similar transformer architectures.

Define your architecture registration

The arch.py file that tells MAX about your model's capabilities. When extending an existing architecture, you can reuse many components:

arch.py
from max.graph.weights import WeightsFormat
from max.interfaces import PipelineTask
from max.nn.kv_cache import KVCacheStrategy
from max.pipelines.architectures.llama3 import weight_adapters
from max.pipelines.lib import (
    RopeType,
    SupportedArchitecture,
    SupportedEncoding,
    TextTokenizer,
)

from .model import Qwen2Model

qwen2_arch = SupportedArchitecture(
    name="Qwen2ForCausalLM",
    task=PipelineTask.TEXT_GENERATION,
    example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
    default_weights_format=WeightsFormat.safetensors,
    default_encoding=SupportedEncoding.bfloat16,
    supported_encodings={
        SupportedEncoding.float32: [KVCacheStrategy.PAGED],
        SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
    },
    pipeline_model=Qwen2Model,
    tokenizer=TextTokenizer,
    rope_type=RopeType.normal,
    weight_adapters={
        WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
        WeightsFormat.gguf: weight_adapters.convert_gguf_state_dict,
    },
)
from max.graph.weights import WeightsFormat
from max.interfaces import PipelineTask
from max.nn.kv_cache import KVCacheStrategy
from max.pipelines.architectures.llama3 import weight_adapters
from max.pipelines.lib import (
    RopeType,
    SupportedArchitecture,
    SupportedEncoding,
    TextTokenizer,
)

from .model import Qwen2Model

qwen2_arch = SupportedArchitecture(
    name="Qwen2ForCausalLM",
    task=PipelineTask.TEXT_GENERATION,
    example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
    default_weights_format=WeightsFormat.safetensors,
    default_encoding=SupportedEncoding.bfloat16,
    supported_encodings={
        SupportedEncoding.float32: [KVCacheStrategy.PAGED],
        SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
    },
    pipeline_model=Qwen2Model,
    tokenizer=TextTokenizer,
    rope_type=RopeType.normal,
    weight_adapters={
        WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
        WeightsFormat.gguf: weight_adapters.convert_gguf_state_dict,
    },
)

This configuration demonstrates several key features of MAX's architecture system. The name parameter must match the model class name in Hugging Face configs, while task specifies the pipeline task type using PipelineTask from max.interfaces. The rope_type parameter specifies the type of rotary position embeddings used by the model.

One of the significant advantages of extending existing architectures is the ability to reuse components. In this case, we're reusing Llama3's weight adapters instead of creating custom ones, which handles the conversion between different weight formats like SafeTensors and GGUF. This reuse pattern is common when extending existing architectures—you can often leverage adapters, configuration handling, and other utilities from the parent model.

Load your architecture

Create an __init__.py file to make your architecture discoverable by MAX:

__init__.py
from .arch import qwen2_arch

ARCHITECTURES = [qwen2_arch]

__all__ = ["qwen2_arch", "ARCHITECTURES"]
from .arch import qwen2_arch

ARCHITECTURES = [qwen2_arch]

__all__ = ["qwen2_arch", "ARCHITECTURES"]

MAX automatically loads any architectures listed in the ARCHITECTURES variable when you specify your module with the --custom-architectures flag.

Test your custom architecture

You can now test your custom architecture using the --custom-architectures flag. From your project directory, run the following command:

max serve \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --custom-architectures qwen2
max serve \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --custom-architectures qwen2

The --model-path flag tells MAX to use a specified model. You can specify the model path to a Hugging Face model, or a local directory containing a model. While the --custom-architectures flag tells MAX to load custom architectures from the specified Python module that we just built.

Trust remote code

Some models require executing custom code from their repository. If you encounter an error about "trust_remote_code", add the --trust-remote-code flag:

max serve \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --custom-architectures qwen2 \
  --trust-remote-code
max serve \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --custom-architectures qwen2 \
  --trust-remote-code

Only use --trust-remote-code with models you trust, as it allows executing arbitrary code from the model repository.

The server is ready when you see this message:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Now you can test your custom architecture. If you implemented an architecture to do text generation, you can send a request to that endpoint. For example:

cURL
Python

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello! Can you help me with a simple task?"}
    ],
    "max_tokens": 100
  }'
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello! Can you help me with a simple task?"}
    ],
    "max_tokens": 100
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",  # Required by API but not used by MAX
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "Hello! Can you help me with a simple task?"}
    ],
    max_tokens=100,
)

print(response.choices[0].message.content)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",  # Required by API but not used by MAX
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "Hello! Can you help me with a simple task?"}
    ],
    max_tokens=100,
)

print(response.choices[0].message.content)

Next steps

Congratulations! You've successfully created a custom architecture for MAX pipelines and served it with the max serve command.

While this tutorial showed the simplified approach of extending an existing architecture, you may need to implement a model from scratch if your architecture differs significantly from MAX's built-in models. In that case, you would:

Implement the full PipelineModel interface including execute, prepare_initial_token_inputs, and prepare_next_token_inputs methods.
Create custom configuration classes to handle model parameters.
Write custom weight adapters for converting between different formats.
Build the computation graph using MAX's graph API.

For implementation details, explore the existing supported model architectures on GitHub. Each subdirectory represents a different model family with its own implementation. You can examine these architectures to understand different approaches and find the best base for your custom architecture.

Here are some areas to explore further:

Set up your environment​

Understand the architecture structure​

Implement the main model class​

Define your architecture registration​

Load your architecture​

Test your custom architecture​

Next steps​

Set up your environment

Understand the architecture structure

Implement the main model class

Define your architecture registration

Load your architecture

Test your custom architecture

Next steps