For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Serve a fine-tuned model on a supported architecture
This page covers serving a fine-tune built on an architecture MAX already supports. To bring up a new architecture, see the model bring-up workflow.
When you fine-tune a Llama 3, Mistral, or Qwen model, the resulting checkpoint runs the same computation as the base: the same attention mechanism, the same normalization placement, the same residual connections. MAX already implements that computation for every architecture in its supported models list, which means serving your fine-tune doesn't require writing or modifying any architecture code. The work here is configuration, pointing MAX at your weights and letting it load them into the implementation it already has.
Where fine-tuned models fit in the customization spectrumβ
MAX customization spans three levels, ordered by how much implementation work each requires:
- Fine-tuned weights on a supported base architecture β no implementation work. This page covers it.
- Weight adapter β a small adapter resolves mismatched weight or config field names that some fine-tuning pipelines introduce, without touching the architecture implementation. Covered in the model bring-up workflow.
- Full architecture package β required when the compute graph itself is new (a different attention variant, routing mechanism, or normalization scheme that MAX hasn't implemented). Also covered in the model bring-up workflow.
The decision signal is the architectures field in your model's config.json.
If it lists an architecture that MAX already supports (LlamaForCausalLM,
MistralForCausalLM, Qwen2ForCausalLM, or another entry from the
supported models list), you're in the right place and no code
is required.
Serve a merged fine-tuneβ
MAX serves merged fine-tunes in two formats: safetensors directly (covered next) and GGUF after quantization (covered in Quantize for production).
When you've merged LoRA weights (or any other fine-tuning) into the base model
using a library like unsloth's save_pretrained_merged or PEFT's model-merging
API, the result is a standard Hugging Face model repo in safetensors format
with your fine-tuned weights baked in. MAX can serve this checkpoint directly,
with no format conversion required.
If your merged checkpoint is published on the Hugging Face Hub, serve it by
passing the repo ID to max serve:
max serve --model <your-hf-org>/<your-fine-tuned-model>MAX downloads the model, reads config.json to identify the architecture class,
loads the matching MAX implementation, and applies your weights. The
architectures field in config.json must name an architecture that MAX
supports; if it doesn't, MAX exits with an error indicating no matching
architecture was found.
If your merged checkpoint is saved to a local directory, pass the directory path instead:
max serve --model /path/to/your/merged-modelMAX reads the local config.json and loads weights from the safetensors files
in that directory. The architecture matching works the same way.
Quantize for productionβ
For production workloads, a quantized checkpoint typically reduces memory requirements and improves throughput. The most common format is GGUF, and two tools handle the conversion from safetensors.
The gguf-my-repo space on
Hugging Face converts your merged model without any local setup: log in, select
your merged model's repo, and choose a quantization method such as Q4_K_M.
After conversion, the GGUF file appears under your HuggingFace username and can
be downloaded for local use.
To convert locally using the llama.cpp script instead, run the converter against your merged model directory:
git clone https://github.com/ggerganov/llama.cpp
python llama.cpp/convert_hf_to_gguf.py /path/to/your/merged-modelOnce you have the GGUF file, serve it with the following command, which uses a base model repo for architecture registration and your converted weights for inference:
max serve \
--model modularai/Llama-3.1-8B-Instruct-GGUF \
--quantization-encoding q4_k \
--weight-path ./models/your-fine-tuned-model-q4_k_m.gguf--model points to a base GGUF model repo that provides the architecture
registration. --weight-path tells MAX to serve your fine-tuned weights instead
of the base model's weights. Both flags are required when serving a custom GGUF
checkpoint; omitting --weight-path causes MAX to serve the base model's
weights rather than yours. All supported quantization encodings are listed in
QuantizationEncoding.
Use LoRA adapters without mergingβ
If you're serving multiple task-specific adapters from a single base model, or
if you want to swap adapters at runtime without restarting the server, you can
skip the merge step and load LoRA adapters at serve time instead. Pass each
adapter as a name-to-path pair with --lora-paths and enable the adapter API
with --enable-lora:
max serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--lora-paths finance=/path/to/finance-adapter \
--enable-lora \
--no-enable-prefix-cachingPrefix caching is enabled by default and is incompatible with LoRA serving, so
--no-enable-prefix-caching is required. LoRA serving currently supports Llama
3 base models with PEFT-trained adapters in safetensors format; other base
architectures are not yet supported. For dynamic loading and unloading of
adapters while the server is running, see
LoRA adapters.
Next stepsβ
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!