Using LoRA adapters with MAX
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) technique that allows you to adapt a large model to new tasks or domains without modifying the original model weights.
Instead of updating the full model, LoRA adds trainable pairs of rank decomposition matrices in parallel to existing weight matrices that capture task-specific behavior. These adapters are small, fast to train, and can be loaded at runtime, making them especially useful in production environments where model reuse, modularity, and memory efficiency are critical.
MAX supports loading and switching between multiple LoRA adapters when serving a base model.
When to use LoRA adapters
LoRA adapters are ideal when you need to customize a foundation model for specific tasks or modalities without the overhead of full fine-tuning or maintaining multiple model variants.
While prompt engineering can steer tone, format, or structure, LoRA adapters are better suited for cases where consistent, domain-specific behavior is required:
- Text: Apply domain-specific fine-tuning. For example, using a
fingpt
LoRA adapter trained for financial jargon and reasoning. - Speech: Swap adapters to switch between different voice profiles in text-to-speech systems.
- Vision: Use separate adapters for image style transfer or other workflows that involve changing visual characteristics.
By encoding task-specific behavior into the model, LoRA adapters can reduce prompt length, eliminate the need for repeated context, and improve inference efficiency.
LoRA adapters also enable you to serve a single base model with multiple specializations, minimizing memory usage and simplifying deployment.
Adapters are especially effective at capturing specialized vocabulary, tone, or structure, and can help address model drift through targeted fine-tuning in production.
How LoRA adapters work in MAX
MAX loads LoRA adapters at model startup and applies them at inference time based on your input request. Each adapter is identified by a unique name and loaded from a local file path.
MAX CLI argument
To load LoRA adapters, use the --lora-paths
argument when serving a model
with the max
CLI:
--lora-paths {name}={path} {name}={path}
: (required) A mapping from each adapter's name to its path, in the form of{name}={path} {name}={path}
.--max-lora-rank
: (optional,int
) Any LoRA adapter loaded when serving a model must have a rank less than or equal to--max-lora-rank
. Use this to limit resource usage or enforce consistency across adapters.--max-num-loras
: (optional,int
) The maximum number of LoRA adapters to manage concurrently.
Each {name}
is a user-defined identifier for an adapter. Each {path}
is a
local path to the LoRA adapter's weights. Multiple adapters can be specified in
a single command.
Compatibility
LoRA adapters must be saved in the safetensors
format and trained using
PEFT.
At this time, only Llama 3 base models are supported.
Only query, key, value, and output (QKVO) layer adapters are supported. Your adapter must only use the following layer projections:
q_proj
k_proj
v_proj
o_proj
Quickstart
We can quickly deploy Llama 3.1 8B Instruct using MAX as a backend with LoRA adapters.
-
Create a virtual environment and install the
max
CLI:- pixi
- uv
- pip
- conda
- If you don't have it, install
pixi
:curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh
Then restart your terminal for the changes to take effect.
- Create a project:
pixi init lora-adapter \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd lora-adapterpixi init lora-adapter \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd lora-adapter - Install the
modular
conda package:- Nightly
- Stable
pixi add modular
pixi add modular
pixi add "modular=25.4"
pixi add "modular=25.4"
- Start the virtual environment:
pixi shell
pixi shell
- If you don't have it, install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init lora-adapter && cd lora-adapter
uv init lora-adapter && cd lora-adapter
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-match --prerelease allowuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-match --prerelease allowuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-match
- Create a project folder:
mkdir lora-adapter && cd lora-adapter
mkdir lora-adapter && cd lora-adapter
- Create and activate a virtual environment:
python3 -m venv .venv/lora-adapter \
&& source .venv/lora-adapter/bin/activatepython3 -m venv .venv/lora-adapter \
&& source .venv/lora-adapter/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install --pre modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install --pre modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install conda. A common choice is with
brew
:brew install miniconda
brew install miniconda
- Initialize
conda
for shell interaction:conda init
conda init
If you're on a Mac, instead use:
conda init zsh
conda init zsh
Then restart your terminal for the changes to take effect.
- Create a project:
conda create -n lora-adapter
conda create -n lora-adapter
- Start the virtual environment:
conda activate lora-adapter
conda activate lora-adapter
- Install the
modular
conda package:- Nightly
- Stable
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
-
Find the path to your local LoRA adapter
First, download an adapter that is trained on Llama 3.1 8B instruct and specifically fine-tunes QKVO layers. You can explore available adapters on Hugging Face.
pip install -U "huggingface_hub[cli]"
hf download FinGPT/fingpt-mt_llama3-8b_lorapip install -U "huggingface_hub[cli]"
hf download FinGPT/fingpt-mt_llama3-8b_loraCopy the location of the downloaded snapshot.
-
Serve a model with a LoRA adapter available
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--lora-paths finance=$HOME.cache/huggingface/hub/models--FinGPT--fingpt-mt_llama3-8b_lora/snapshots/5b5850574ec13e4ce7c102e24f763205992711b7max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--lora-paths finance=$HOME.cache/huggingface/hub/models--FinGPT--fingpt-mt_llama3-8b_lora/snapshots/5b5850574ec13e4ce7c102e24f763205992711b7This command serves the base model and loads a LoRA adapter named
finance
. -
Run inference using a specific adapter
When sending an inference request, specify the name of the adapter to apply. For example:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"prompt": "What is an iron condor?",
"max_tokens": 150,
"lora": "finance"
}'curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"prompt": "What is an iron condor?",
"max_tokens": 150,
"lora": "finance"
}'This tells MAX to apply the
finance
adapter during inference.
Next steps
If you're using PEFT weights that have already been merged with the base model, check out our guide on bringing your own model into MAX.
If you're eager for LoRA support for a different base model, you can check out the community to start contributing, or start a discussion in the forum. We'd love to hear from you!
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!