Skip to main content

Using LoRA adapters with MAX

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) technique that allows you to adapt a large model to new tasks or domains without modifying the original model weights.

Instead of updating the full model, LoRA adds trainable pairs of rank decomposition matrices in parallel to existing weight matrices that capture task-specific behavior. These adapters are small, fast to train, and can be loaded at runtime, making them especially useful in production environments where model reuse, modularity, and memory efficiency are critical.

MAX supports loading and switching between multiple LoRA adapters when serving a base model.

When to use LoRA adapters

LoRA adapters are ideal when you need to customize a foundation model for specific tasks or modalities without the overhead of full fine-tuning or maintaining multiple model variants.

While prompt engineering can steer tone, format, or structure, LoRA adapters are better suited for cases where consistent, domain-specific behavior is required:

  • Text: Apply domain-specific fine-tuning. For example, using a fingpt LoRA adapter trained for financial jargon and reasoning.
  • Speech: Swap adapters to switch between different voice profiles in text-to-speech systems.
  • Vision: Use separate adapters for image style transfer or other workflows that involve changing visual characteristics.

By encoding task-specific behavior into the model, LoRA adapters can reduce prompt length, eliminate the need for repeated context, and improve inference efficiency.

LoRA adapters also enable you to serve a single base model with multiple specializations, minimizing memory usage and simplifying deployment.

Adapters are especially effective at capturing specialized vocabulary, tone, or structure, and can help address model drift through targeted fine-tuning in production.

How LoRA adapters work in MAX

MAX loads LoRA adapters at model startup and applies them at inference time based on your input request. Each adapter is identified by a unique name and loaded from a local file path.

MAX CLI argument

To load LoRA adapters, use the --lora-paths argument when serving a model with the max CLI:

  • --lora-paths {name}={path} {name}={path}: (required) A mapping from each adapter's name to its path, in the form of {name}={path} {name}={path}.
  • --max-lora-rank: (optional, int) Any LoRA adapter loaded when serving a model must have a rank less than or equal to --max-lora-rank. Use this to limit resource usage or enforce consistency across adapters.
  • --max-num-loras: (optional, int) The maximum number of LoRA adapters to manage concurrently.

Each {name} is a user-defined identifier for an adapter. Each {path} is a local path to the LoRA adapter's weights. Multiple adapters can be specified in a single command.

Compatibility

LoRA adapters must be saved in the safetensors format and trained using PEFT.

At this time, only Llama 3 base models are supported.

Only query, key, value, and output (QKVO) layer adapters are supported. Your adapter must only use the following layer projections:

  • q_proj
  • k_proj
  • v_proj
  • o_proj

Quickstart

We can quickly deploy Llama 3.1 8B Instruct using MAX as a backend with LoRA adapters.

  1. Create a virtual environment and install the max CLI:

    1. If you don't have it, install pixi:
      curl -fsSL https://pixi.sh/install.sh | sh
      curl -fsSL https://pixi.sh/install.sh | sh

      Then restart your terminal for the changes to take effect.

    2. Create a project:
      pixi init lora-adapter \
      -c https://conda.modular.com/max-nightly/ -c conda-forge \
      && cd lora-adapter
      pixi init lora-adapter \
      -c https://conda.modular.com/max-nightly/ -c conda-forge \
      && cd lora-adapter
    3. Install the modular conda package:
      pixi add modular
      pixi add modular
    4. Start the virtual environment:
      pixi shell
      pixi shell
  2. Find the path to your local LoRA adapter

    First, download an adapter that is trained on Llama 3.1 8B instruct and specifically fine-tunes QKVO layers. You can explore available adapters on Hugging Face.

    pip install -U "huggingface_hub[cli]"

    hf download FinGPT/fingpt-mt_llama3-8b_lora
    pip install -U "huggingface_hub[cli]"

    hf download FinGPT/fingpt-mt_llama3-8b_lora

    Copy the location of the downloaded snapshot.

  3. Serve a model with a LoRA adapter available

    max serve \
    --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
    --lora-paths finance=$HOME.cache/huggingface/hub/models--FinGPT--fingpt-mt_llama3-8b_lora/snapshots/5b5850574ec13e4ce7c102e24f763205992711b7
    max serve \
    --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
    --lora-paths finance=$HOME.cache/huggingface/hub/models--FinGPT--fingpt-mt_llama3-8b_lora/snapshots/5b5850574ec13e4ce7c102e24f763205992711b7

    This command serves the base model and loads a LoRA adapter named finance.

  4. Run inference using a specific adapter

    When sending an inference request, specify the name of the adapter to apply. For example:

    curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
    "prompt": "What is an iron condor?",
    "max_tokens": 150,
    "lora": "finance"
    }'
    curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
    "prompt": "What is an iron condor?",
    "max_tokens": 150,
    "lora": "finance"
    }'

    This tells MAX to apply the finance adapter during inference.

Next steps

If you're using PEFT weights that have already been merged with the base model, check out our guide on bringing your own model into MAX.

If you're eager for LoRA support for a different base model, you can check out the community to start contributing, or start a discussion in the forum. We'd love to hear from you!