Skip to main content
Log in

Bring your own fine-tuned model to MAX pipelines

Updated:Β 

|
12 min read

In the MAX 24.4 release, we have introduced native support for quantization and GGUF weight format. In this tutorial, we'll guide you through the steps to integrate your fine-tuned custom model into the MAX pipelines. More specifically, we will start with the initial configuration and then demonstrate how to download a model from the Hugging Face Hub. If the model is not already available in a supported quantized GGUF format, we'll show you how to convert it to prepare for ingestion into the MAX pipelines. Finally, we will explore how to use the quantized GGUF model via the MAX pipelines CLI.

note

Note: This tutorial requires MAX nightly. Please ensure that you install the nightly build.

Trouble?

If you experience any issues in this tutorial, please let us know on GitHub.

About model customization​

Model customization in machine learning typically involves modifying a pre-trained model to better suit specific tasks or datasets. One effective approach is fine-tuning, where a model trained on a large dataset is further trained (or fine-tuned) on a smaller, task-specific dataset. In this tutorial, we focus on Low Rank Adaptation (LoRA). LoRA (and its quantized variant QLoRA) allows for efficient adaptation of large models by only updating a small set of additional parameters, preserving the original model's structure by integrating LoRA layers without altering the primary architecture. For this tutorial, we are assuming the LoRA weights have been merged into the original model such as Llama3.1. Such a functionality is provided by major fine-tuning libraries such as unsloth save_pretrained_merged or using PEFT model merging APIs.

Step 1: Set up HuggingFace access​

To interact with models hosted on HuggingFace, secure access is required either via SSH or an access token. Follow the instructions in the HuggingFace documentation to set up SSH. We can verify our configuration by running:

ssh -T git@hf.co

A successful setup will display Hi <USERNAME>, welcome to Hugging Face.

Step 2: Install the required packages​

Install the huggingface_hub library to enable interactions with the HuggingFace Hub. This package facilitates the download, and management of models and datasets:

python -m pip install -U "huggingface_hub[hf_transfer]"
Download a fine-tuned model via the HuggingFace CLI**

We can directly navigate to the HuggingFace model Hub and download a fine-tuned Llama3.1 model via the following command.

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download <REPO-ID/MODEL-ID> \
--repo-type model \
--local-dir ./models

This step pulls the specified model and saves it locally, allowing for offline access and integration. You can easily use your own fine-tuned model too by following the next steps.

Step 3: Convert to GGUF format​

If your model is already in the GGUF format, you can skip this conversion step and proceed directly to the next step. If not, here are the most common methods to convert a model to a quantized GGUF format suitable for deployment:

  • Manually convert via llama.cpp script: Utilize the llama.cpp converter script to manually convert your model.

    git clone https://github.com/ggerganov/llama.cpp

    HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download <REPO-ID/MODEL-ID> \
    --repo-type model \
    --local-dir ./models

    python llama.cpp/convert_hf_to_gguf.py models

    Ensure you replace <REPO-ID/MODEL-ID> with the appropriate repository or model ID from HuggingFace.

  • Automated conversion via HuggingFace space: Alternatively, we can use the gguf-my-repo space for a streamlined conversion process to convert to a supported quantized GGUF format. Remember to log in and for the sake of this tutorial, we choose the Q4_K_M quantization method.

You can see all the supported quantization encoding in the encodings module. For demonstration, we will choose mlabonne/FineLlama-3.1-8B.

After conversion, the model will be available under your username, ready for download and deployment.

Step 4: Set up MAX pipelines​

Next is to clone the MAX GitHub repository and navigate to the MAX pipeline:

git clone https://github.com/modularml/max && cd max && git checkout nightly
cd examples/graph-api

The following will download the converted GGUF model:

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download <USERNAME>/FineLlama-3.1-8B-Q4_K_M-GGUF \
--repo-type model \
--local-dir ./models

With all the requirements in place we are now ready to use our custom model in MAX pipelines.

Step 5: Run the custom model​

We are ready to use our custom model in MAX pipelines CLI by providing --model-path via:

mojo run_pipeline.mojo llama3 --version 3.1 \
--model-path "./model/<USERNAME>/finellama-3.1-8b-q4_k_m.gguf \
--prompt "What's the meaning of life?"

It generates the following beautiful answer:

The meaning of life is a question that has been pondered by philosophers, scientists,
and spiritual leaders for centuries. It is a question that has no definitive answer,
as it is deeply personal and subjective to each individual. However, many have
attempted to provide their own interpretations or explanations.

One interpretation of the meaning of life is that it is simply to live and experience
the world around us. This view suggests that the purpose of life is to experience all
that it has to offer, whether it be through the senses, emotions, or intellectual
pursuits. In this sense, the meaning of life is not necessarily tied to any specific goal
or achievement, but rather to the process of living itself.

Another interpretation is that the meaning of life is to find purpose and meaning in
our lives. This view suggests that we are here to seek out our own unique purpose and
to strive to achieve it. This can be achieved through various means, such as through our
work, relationships, or personal pursuits.

A third interpretation is that the meaning of life is to connect with something larger
than ourselves. This view suggests that we are here to connect with a higher power,
whether it be through religion, spirituality, or a sense of awe and wonder at the
universe. In this sense, the meaning of life is to find a sense of purpose and
connection that transcends our individual lives.

Ultimately, the meaning of life is a question that each person must answer for themselves.
It is a question that requires us to reflect on our own values, beliefs, and experiences.
As the saying goes, "Ask a flower" - the meaning of life is not something that can be
answered in words, but rather in the experience of living itself.

Next steps​

Congratulations on successfully integrating your fine-tuned Llama3.1 model into the MAX pipelines! πŸŽ‰

We have navigated through setting up secure access, downloading and converting models, and finally running your custom model in MAX pipelines.

We encourage you to further customize your models via the MAX Graph API, test your pipeline and explore other MAX features.

Here are some other topics to explore next:

Did this tutorial work for you?