IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Serve a fine-tuned model on a supported architecture

This page covers serving a fine-tune built on an architecture MAX already supports. To bring up a new architecture, see the model bring-up workflow.

When you fine-tune a Llama 3, Mistral, or Qwen model, the resulting checkpoint runs the same computation as the base: the same attention mechanism, the same normalization placement, the same residual connections. MAX already implements that computation for every architecture in its supported models list, which means serving your fine-tune doesn't require writing or modifying any architecture code. The work here is configuration, pointing MAX at your weights and letting it load them into the implementation it already has.

Where fine-tuned models fit in the customization spectrum​

MAX customization spans three levels, ordered by how much implementation work each requires:

  • Fine-tuned weights on a supported base architecture β€” no implementation work. This page covers it.
  • Weight adapter β€” a small adapter resolves mismatched weight or config field names that some fine-tuning pipelines introduce, without touching the architecture implementation. Covered in the model bring-up workflow.
  • Full architecture package β€” required when the compute graph itself is new (a different attention variant, routing mechanism, or normalization scheme that MAX hasn't implemented). Also covered in the model bring-up workflow.

The decision signal is the architectures field in your model's config.json. If it lists an architecture that MAX already supports (LlamaForCausalLM, MistralForCausalLM, Qwen2ForCausalLM, or another entry from the supported models list), you're in the right place and no code is required.

Serve a merged fine-tune​

MAX serves merged fine-tunes in two formats: safetensors directly (covered next) and GGUF after quantization (covered in Quantize for production).

When you've merged LoRA weights (or any other fine-tuning) into the base model using a library like unsloth's save_pretrained_merged or PEFT's model-merging API, the result is a standard Hugging Face model repo in safetensors format with your fine-tuned weights baked in. MAX can serve this checkpoint directly, with no format conversion required.

If your merged checkpoint is published on the Hugging Face Hub, serve it by passing the repo ID to max serve:

max serve --model <your-hf-org>/<your-fine-tuned-model>

MAX downloads the model, reads config.json to identify the architecture class, loads the matching MAX implementation, and applies your weights. The architectures field in config.json must name an architecture that MAX supports; if it doesn't, MAX exits with an error indicating no matching architecture was found.

If your merged checkpoint is saved to a local directory, pass the directory path instead:

max serve --model /path/to/your/merged-model

MAX reads the local config.json and loads weights from the safetensors files in that directory. The architecture matching works the same way.

Quantize for production​

For production workloads, a quantized checkpoint typically reduces memory requirements and improves throughput. The most common format is GGUF, and two tools handle the conversion from safetensors.

The gguf-my-repo space on Hugging Face converts your merged model without any local setup: log in, select your merged model's repo, and choose a quantization method such as Q4_K_M. After conversion, the GGUF file appears under your HuggingFace username and can be downloaded for local use.

To convert locally using the llama.cpp script instead, run the converter against your merged model directory:

git clone https://github.com/ggerganov/llama.cpp
python llama.cpp/convert_hf_to_gguf.py /path/to/your/merged-model

Once you have the GGUF file, serve it with the following command, which uses a base model repo for architecture registration and your converted weights for inference:

max serve \
  --model modularai/Llama-3.1-8B-Instruct-GGUF \
  --quantization-encoding q4_k \
  --weight-path ./models/your-fine-tuned-model-q4_k_m.gguf

--model points to a base GGUF model repo that provides the architecture registration. --weight-path tells MAX to serve your fine-tuned weights instead of the base model's weights. Both flags are required when serving a custom GGUF checkpoint; omitting --weight-path causes MAX to serve the base model's weights rather than yours. All supported quantization encodings are listed in QuantizationEncoding.

Use LoRA adapters without merging​

If you're serving multiple task-specific adapters from a single base model, or if you want to swap adapters at runtime without restarting the server, you can skip the merge step and load LoRA adapters at serve time instead. Pass each adapter as a name-to-path pair with --lora-paths and enable the adapter API with --enable-lora:

max serve \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --lora-paths finance=/path/to/finance-adapter \
  --enable-lora \
  --no-enable-prefix-caching

Prefix caching is enabled by default and is incompatible with LoRA serving, so --no-enable-prefix-caching is required. LoRA serving currently supports Llama 3 base models with PEFT-trained adapters in safetensors format; other base architectures are not yet supported. For dynamic loading and unloading of adapters while the server is running, see LoRA adapters.

Next steps​

Was this page helpful?