Using LoRA adapters
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) technique that allows you to adapt a large model to new tasks or domains without modifying the original model weights.
Instead of updating the full model, LoRA adds trainable pairs of rank decomposition matrices in parallel to existing weight matrices that capture task-specific behavior. These adapters are small, fast to train, and can be loaded at runtime, making them especially useful in production environments where model reuse, modularity, and memory efficiency are critical.
MAX supports loading and switching between multiple LoRA adapters when serving a base model.
When to use LoRA adapters
LoRA adapters are ideal when you need to customize a foundation model for specific tasks or modalities without the overhead of full fine-tuning or maintaining multiple model variants.
While prompt engineering can steer tone, format, or structure, LoRA adapters are better suited for cases where consistent, domain-specific behavior is required:
- Text: Apply domain-specific fine-tuning. For example, using a fingptLoRA adapter trained for financial jargon and reasoning.
- Speech: Swap adapters to switch between different voice profiles in text-to-speech systems.
- Vision: Use separate adapters for image style transfer or other workflows that involve changing visual characteristics.
By encoding task-specific behavior into the model, LoRA adapters can reduce prompt length, eliminate the need for repeated context, and improve inference efficiency.
LoRA adapters also enable you to serve a single base model with multiple specializations, minimizing memory usage and simplifying deployment.
Adapters are especially effective at capturing specialized vocabulary, tone, or structure, and can help address model drift through targeted fine-tuning in production.
How LoRA adapters work in MAX
MAX loads LoRA adapters at model startup and applies them at inference time based on your input request. Each adapter is identified by a unique name and loaded from a local file path.
MAX CLI argument
You can statically or dynamically load LoRA adapters when serving a model with
the max CLI. To use LoRA adapters, configure the appropriate
max serve arguments for your use case:
- --lora-paths {name}={path} {name}={path}: (optional) A mapping from each adapter's name to its path, in the form of- {name}={path} {name}={path}.
- --max-lora-rank: (optional,- int) Any LoRA adapter loaded when serving a model must have a rank less than or equal to- --max-lora-rank. Use this to limit resource usage or enforce consistency across adapters.
- --max-num-loras: (optional,- int) The maximum number of LoRA adapters to manage concurrently.
- --enable-lora: (optional) Allows LoRA adapter use in inference requests and enables the API for dynamic loading and unloading. For more information, see dynamic serving.
- --no-enable-lora: (optional) Does not allow the use of LoRA adapters. Models served with the- maxCLI use- --no-enable-loraby default. Any LoRA-related arguments in an inference request are ignored. LoRA dynamic serving APIs are unavailable.
- --no-enable-prefix-caching: LoRA adapters are not compatible with prefix caching, which is enabled by default. You must disable prefix caching to use LoRA adapters.
Each {name} is a user-defined identifier for an adapter. Each {path} is a
local path to the LoRA adapter's weights. Multiple adapters can be specified in a
single command.
Dynamic serving
To dynamically load and unload LoRA adapters, you must first serve your model
with the --enable-lora argument:
max serve \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --enable-lora \
      --no-enable-prefix-cachingTo dynamically load a LoRA adapter, send a POST request to the
v1/load_lora_adapter endpoint specifying the LoRA adapter name and path:
curl -X POST http://localhost:8000/v1/load_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{
      "lora_name": "example",
      "lora_path": "$HOME/.cache/huggingface/hub/models--example--lora-adapter/snapshots/abc123"
  }'You should see the following response:
{"status":"success","message":"LoRA adapter 'example' loaded successfully"}To unload a LoRA adapter, send a POST request to the v1/unload_lora_adapter
endpoint specifying the name of the LoRA adapter to unload:
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "example"}'You should see the following response if the adapter was unloaded successfully:
{"status":"success","message":"LoRA adapter 'example' unloaded successfully"}Compatibility
LoRA adapters must be saved in the safetensors format and trained using
PEFT.
At this time, only Llama 3 base models are supported.
Only query, key, value, and output (QKVO) layer adapters are supported. Your adapter must only use the following layer projections:
- q_proj
- k_proj
- v_proj
- o_proj
Quickstart
We can quickly deploy Llama 3.1 8B Instruct using MAX as a backend with LoRA adapters.
- 
Create a virtual environment and install the maxCLI:- pixi
- uv
- pip
- conda
 - If you don't have it, install pixi:curl -fsSL https://pixi.sh/install.sh | shThen restart your terminal for the changes to take effect. 
- Create a project:pixi init lora-adapter \ -c https://conda.modular.com/max-nightly/ -c conda-forge \ && cd lora-adapter
- Install the modularconda package:- Nightly
- Stable
 pixi add modularpixi add "modular==25.6"
- Start the virtual environment:pixi shell
 - If you don't have it, install uv:curl -LsSf https://astral.sh/uv/install.sh | shThen restart your terminal to make uvaccessible.
- Create a project:uv init lora-adapter && cd lora-adapter
- Create and start a virtual environment:uv venv && source .venv/bin/activate
- Install the modularPython package:- Nightly
- Stable
 uv pip install modular \ --index-url https://dl.modular.com/public/nightly/python/simple/ \ --prerelease allowuv pip install modular \ --extra-index-url https://modular.gateway.scarf.sh/simple/
 - Create a project folder:mkdir lora-adapter && cd lora-adapter
- Create and activate a virtual environment:python3 -m venv .venv/lora-adapter \ && source .venv/lora-adapter/bin/activate
- Install the modularPython package:- Nightly
- Stable
 pip install --pre modular \ --index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \ --extra-index-url https://modular.gateway.scarf.sh/simple/
 - If you don't have it, install conda. A common choice is with brew:brew install miniconda
- Initialize condafor shell interaction:conda initIf you're on a Mac, instead use: conda init zshThen restart your terminal for the changes to take effect. 
- Create a project:conda create -n lora-adapter
- Start the virtual environment:conda activate lora-adapter
- Install the modularconda package:- Nightly
- Stable
 conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modularconda install -c conda-forge -c https://conda.modular.com/max/ modular
 
- 
Find the path to your local LoRA adapter First, download an adapter that is trained on Llama 3.1 8B instruct and specifically fine-tunes QKVO layers. You can explore available adapters on Hugging Face. pip install -U "huggingface_hub[cli]" hf download FinGPT/fingpt-mt_llama3-8b_loraCopy the location of the downloaded snapshot. 
- 
Serve a model with a LoRA adapter available Change the --lora-pathspath to the location of the downloaded LoRA adapter snapshot.max serve \ --model meta-llama/Llama-3.1-8B-Instruct \ --enable-lora \ --no-enable-prefix-caching \ --lora-paths finance=$HOME.cache/huggingface/hub/models--FinGPT--fingpt-mt_llama3-8b_lora/snapshots/5b5850574ec13e4ce7c102e24f763205992711b7This command serves the base model and statically loads a LoRA adapter named finance.
- 
Run inference using a specific adapter When sending an inference request, specify the name of the adapter to apply. For example: curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "What is an iron condor?", "max_tokens": 150, "lora": "finance" }'This tells MAX to apply the financeadapter during inference.
Next steps
If you're using PEFT weights that have already been merged with the base model, check out our guide on bringing your own model into MAX.
If you're eager for LoRA support for a different base model, you can check out the community to start contributing, or start a discussion in the forum. We'd love to hear from you!
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
