Serve your custom model
You can serve a model with a custom architecture using the same max serve command that runs MAX-supported
models. Once max serve loads your architecture package, your
model gets batching, KV cache management, tokenization, and the rest of the MAX
serving stack on top of it. This page teaches you how to make your custom
architecture package compatible with max serve.
How MAX registers custom architecturesβ
A custom model architecture package is a Python package (a directory containing
an __init__.py) that MAX can import. To serve the model, the __init__.py
must expose an ARCHITECTURES list of
SupportedArchitecture
instances. Each SupportedArchitecture
instance bundles the model class, config, tokenizer, encodings, and weight
adapters under a name.
A minimal
__init__.py looks like this:
from .arch import my_arch
ARCHITECTURES = [my_arch]When you pass your package to max serve with --custom-architectures, MAX
imports it, reads the ARCHITECTURES list, and registers every entry. On each
request, MAX matches the checkpoint's architectures[0] field against the
registered names to pick the right implementation. If you expose a
SupportedArchitecture whose name matches a built-in architecture's name,
your custom architecture takes precedence.
Run max serve with your custom architectureβ
From the directory that contains your architecture package, run:
max serve \
--model Qwen/Qwen2.5-7B-Instruct \
--custom-architectures my_arch_package-
--modelspecifies the Hugging Face model ID or local path for the checkpoint to load. -
--custom-architecturesspecifies your custom architecture package. If you run the command from outside your architecture's root directory, you can pass an import path followed by a colon and the Python package name, such asfolder/path/to/import:my_arch_package.
Send a requestβ
When the server is ready to accept requests, you'll see:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)The server exposes an OpenAI-compatible API. Send a chat completion request with cURL or the OpenAI Python client:
- cURL
- Python
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<your-hf-repo>",
"messages": [
{"role": "user", "content": "Hello! Can you help me with a simple task?"}
],
"max_tokens": 100
}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY", # Required by the API but not used by MAX
)
response = client.chat.completions.create(
model="<your-hf-repo>",
messages=[
{"role": "user", "content": "Hello! Can you help me with a simple task?"}
],
max_tokens=100,
)
print(response.choices[0].message.content)The model field in the request body must match the value you passed to
--model when you started the server.
Troubleshootingβ
Errors can pop up at serve time related to issues like unsupported quantization
encodings, missing weight adapters, or a name that doesn't match
architectures[0]. These usually indicate a mismatch between your
SupportedArchitecture and the checkpoint. If you encounter a similar error,
double check that:
-
namematches thearchitectures[0]field in your checkpoint'sconfig.jsonexactly. -
supported_encodingsincludes every encoding your checkpoint ships with. -
weight_adaptershas an entry for each weight format you want to load (a.safetensorscheckpoint needs aWeightsFormat.safetensorsentry, a.ggufcheckpoint needs aWeightsFormat.ggufentry, and so on).
To learn more about these fields, see Model pipeline.
Next stepsβ
Now that you know how to serve your model, you can layer on serving features and performance optimizations that work with any architecture that MAX loads.
- Prefix caching: Reuse the KV cache across requests that share a prompt prefix to cut time-to-first-token on repeated workloads.
- LoRA adapters: Serve multiple fine-tuned adapters on top of your base architecture without loading separate model copies.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!