Skip to main content

Serve your custom model

You can serve a model with a custom architecture using the same max serve command that runs MAX-supported models. Once max serve loads your architecture package, your model gets batching, KV cache management, tokenization, and the rest of the MAX serving stack on top of it. This page teaches you how to make your custom architecture package compatible with max serve.

How MAX registers custom architectures​

A custom model architecture package is a Python package (a directory containing an __init__.py) that MAX can import. To serve the model, the __init__.py must expose an ARCHITECTURES list of SupportedArchitecture instances. Each SupportedArchitecture instance bundles the model class, config, tokenizer, encodings, and weight adapters under a name.

A minimal __init__.py looks like this:

__init__.py
from .arch import my_arch

ARCHITECTURES = [my_arch]

When you pass your package to max serve with --custom-architectures, MAX imports it, reads the ARCHITECTURES list, and registers every entry. On each request, MAX matches the checkpoint's architectures[0] field against the registered names to pick the right implementation. If you expose a SupportedArchitecture whose name matches a built-in architecture's name, your custom architecture takes precedence.

Run max serve with your custom architecture​

From the directory that contains your architecture package, run:

max serve \
  --model Qwen/Qwen2.5-7B-Instruct \
  --custom-architectures my_arch_package
  • --model specifies the Hugging Face model ID or local path for the checkpoint to load.

  • --custom-architectures specifies your custom architecture package. If you run the command from outside your architecture's root directory, you can pass an import path followed by a colon and the Python package name, such as folder/path/to/import:my_arch_package.

Send a request​

When the server is ready to accept requests, you'll see:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

The server exposes an OpenAI-compatible API. Send a chat completion request with cURL or the OpenAI Python client:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<your-hf-repo>",
    "messages": [
      {"role": "user", "content": "Hello! Can you help me with a simple task?"}
    ],
    "max_tokens": 100
  }'

The model field in the request body must match the value you passed to --model when you started the server.

Troubleshooting​

Errors can pop up at serve time related to issues like unsupported quantization encodings, missing weight adapters, or a name that doesn't match architectures[0]. These usually indicate a mismatch between your SupportedArchitecture and the checkpoint. If you encounter a similar error, double check that:

  • name matches the architectures[0] field in your checkpoint's config.json exactly.

  • supported_encodings includes every encoding your checkpoint ships with.

  • weight_adapters has an entry for each weight format you want to load (a .safetensors checkpoint needs a WeightsFormat.safetensors entry, a .gguf checkpoint needs a WeightsFormat.gguf entry, and so on).

To learn more about these fields, see Model pipeline.

Next steps​

Now that you know how to serve your model, you can layer on serving features and performance optimizations that work with any architecture that MAX loads.

  • Prefix caching: Reuse the KV cache across requests that share a prompt prefix to cut time-to-first-token on repeated workloads.
  • LoRA adapters: Serve multiple fine-tuned adapters on top of your base architecture without loading separate model copies.

Was this page helpful?