Serve your custom model

You can serve a model with a custom architecture using the same max serve command that runs MAX-supported models. Once max serve loads your architecture package, your model gets batching, KV cache management, tokenization, and the rest of the MAX serving stack on top of it. This page teaches you how to make your custom architecture package compatible with max serve.

If you don't have a complete custom architecture package, see the Model bring-up workflow to learn how to implement one.

How MAX registers custom architectures

A custom model architecture package is a Python package (a directory containing an __init__.py) that MAX can import. To serve the model, the __init__.py must expose an ARCHITECTURES list of SupportedArchitecture instances. Each SupportedArchitecture instance bundles the model class, config, tokenizer, encodings, and weight adapters under a name.

A minimal __init__.py looks like this:

__init__.py
from .arch import my_arch

ARCHITECTURES = [my_arch]

When you pass your package to max serve with --custom-architectures, MAX imports it, reads the ARCHITECTURES list, and registers every entry. On each request, MAX matches the checkpoint's architectures[0] field against the registered names to pick the right implementation. If you expose a SupportedArchitecture whose name matches a built-in architecture's name, your custom architecture takes precedence.

Run `max serve` with your custom architecture

From the directory that contains your architecture package, run:

max serve \
  --model Qwen/Qwen2.5-7B-Instruct \
  --custom-architectures my_arch_package

--model specifies the Hugging Face model ID or local path for the checkpoint to load.
--custom-architectures specifies your custom architecture package. If you run the command from outside your architecture's root directory, you can pass an import path followed by a colon and the Python package name, such as folder/path/to/import:my_arch_package.

Trust remote code

Some Hugging Face repos ship custom Python files (such as modeling_*.py or a custom tokenizer) that the loader executes when it instantiates the model. If you see a trust_remote_code error, add --trust-remote-code to opt in. Only use this flag with model repositories you trust.

Send a request

When the server is ready to accept requests, you'll see:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

The server exposes an OpenAI-compatible API. Send a chat completion request with cURL or the OpenAI Python client:

cURL
Python

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<your-hf-repo>",
    "messages": [
      {"role": "user", "content": "Hello! Can you help me with a simple task?"}
    ],
    "max_tokens": 100
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",  # Required by the API but not used by MAX
)

response = client.chat.completions.create(
    model="<your-hf-repo>",
    messages=[
        {"role": "user", "content": "Hello! Can you help me with a simple task?"}
    ],
    max_tokens=100,
)

print(response.choices[0].message.content)

The model field in the request body must match the value you passed to --model when you started the server.

Troubleshooting

Errors can pop up at serve time related to issues like unsupported quantization encodings, missing weight adapters, or a name that doesn't match architectures[0]. These usually indicate a mismatch between your SupportedArchitecture and the checkpoint. If you encounter a similar error, double check that:

name matches the architectures[0] field in your checkpoint's config.json exactly.
supported_encodings includes every encoding your checkpoint ships with.
weight_adapters has an entry for each weight format you want to load (a .safetensors checkpoint needs a WeightsFormat.safetensors entry, a .gguf checkpoint needs a WeightsFormat.gguf entry, and so on).

To learn more about these fields, see Model pipeline.

Next steps

Now that you know how to serve your model, you can layer on serving features and performance optimizations that work with any architecture that MAX loads.

Prefix caching: Reuse the KV cache across requests that share a prompt prefix to cut time-to-first-token on repeated workloads.
LoRA adapters: Serve multiple fine-tuned adapters on top of your base architecture without loading separate model copies.

How MAX registers custom architectures​

Run max serve with your custom architecture​

Send a request​

Troubleshooting​

Next steps​

How MAX registers custom architectures

Run `max serve` with your custom architecture

Send a request

Troubleshooting

Next steps