In the MAX 24.4 release, we have introduced native support for quantization and GGUF weight format. In this tutorial, we'll guide you through the steps to integrate your fine-tuned custom model into the MAX pipelines. More specifically, we will start with the initial configuration and then demonstrate how to download a model from the Hugging Face Hub. If the model is not already available in a supported quantized GGUF format, we'll show you how to convert it to prepare for ingestion into the MAX pipelines. Finally, we will explore how to use the quantized GGUF model via the MAX pipelines CLI.
About model customization
Model customization in machine learning typically involves modifying a pre-trained model to better suit specific tasks or datasets.
One effective approach is fine-tuning, where a model trained on a large dataset is further trained (or fine-tuned) on a smaller,
task-specific dataset. In this tutorial, we focus on Low Rank Adaptation (LoRA).
LoRA (and its quantized variant QLoRA) allows for efficient adaptation of large models by only updating a small set of additional parameters,
preserving the original model's structure by integrating LoRA layers without altering the primary architecture. For this tutorial, we are assuming
the LoRA weights have been merged into the original model such as Llama3.1. Such a functionality is provided by major fine-tuning libraries such as
unsloth save_pretrained_merged
or using PEFT model merging APIs.
Step 1: Set up HuggingFace access
To interact with models hosted on HuggingFace, secure access is required either via SSH or an access token. Follow the instructions in the HuggingFace documentation to set up SSH. We can verify our configuration by running:
ssh -T git@hf.co
ssh -T git@hf.co
A successful setup will display Hi <USERNAME>, welcome to Hugging Face
.
Step 2: Install the required packages
Install the huggingface_hub
library to enable interactions with the HuggingFace Hub.
This package facilitates the download, and management of models and datasets:
python -m pip install -U "huggingface_hub[hf_transfer]"
python -m pip install -U "huggingface_hub[hf_transfer]"
Step 3: Convert to GGUF format
If your model is already in the GGUF format, you can skip this conversion step and proceed directly to the next step. If not, here are the most common methods to convert a model to a quantized GGUF format suitable for deployment:
-
Manually convert via llama.cpp script: Utilize the llama.cpp converter script to manually convert your model.
git clone https://github.com/ggerganov/llama.cpp
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download <REPO-ID/MODEL-ID> \
--repo-type model \
--local-dir ./models
python llama.cpp/convert_hf_to_gguf.py modelsgit clone https://github.com/ggerganov/llama.cpp
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download <REPO-ID/MODEL-ID> \
--repo-type model \
--local-dir ./models
python llama.cpp/convert_hf_to_gguf.py modelsEnsure you replace
<REPO-ID/MODEL-ID>
with the appropriate repository or model ID from HuggingFace. -
Automated conversion via HuggingFace space: Alternatively, we can use the gguf-my-repo space for a streamlined conversion process to convert to a supported quantized GGUF format. Remember to log in and for the sake of this tutorial, we choose the
Q4_K_M
quantization method.
You can see all the supported quantization encoding in the encodings
module.
For demonstration, we will choose mlabonne/FineLlama-3.1-8B.
After conversion, the model will be available under your username, ready for download and deployment.
Step 4: Set up MAX pipelines
Next is to clone the MAX GitHub repository and navigate to the MAX pipeline:
git clone https://github.com/modularml/max && cd max && git checkout nightly
cd examples/graph-api
git clone https://github.com/modularml/max && cd max && git checkout nightly
cd examples/graph-api
The following will download the converted GGUF model:
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download <USERNAME>/FineLlama-3.1-8B-Q4_K_M-GGUF \
--repo-type model \
--local-dir ./models
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download <USERNAME>/FineLlama-3.1-8B-Q4_K_M-GGUF \
--repo-type model \
--local-dir ./models
With all the requirements in place we are now ready to use our custom model in MAX pipelines.
Step 5: Run the custom model
We are ready to use our custom model in MAX pipelines CLI by providing --model-path
via:
mojo run_pipeline.mojo llama3 --version 3.1 \
--model-path "./model/<USERNAME>/finellama-3.1-8b-q4_k_m.gguf \
--prompt "What's the meaning of life?"
mojo run_pipeline.mojo llama3 --version 3.1 \
--model-path "./model/<USERNAME>/finellama-3.1-8b-q4_k_m.gguf \
--prompt "What's the meaning of life?"
It generates the following beautiful answer:
The meaning of life is a question that has been pondered by philosophers, scientists,
and spiritual leaders for centuries. It is a question that has no definitive answer,
as it is deeply personal and subjective to each individual. However, many have
attempted to provide their own interpretations or explanations.
One interpretation of the meaning of life is that it is simply to live and experience
the world around us. This view suggests that the purpose of life is to experience all
that it has to offer, whether it be through the senses, emotions, or intellectual
pursuits. In this sense, the meaning of life is not necessarily tied to any specific goal
or achievement, but rather to the process of living itself.
Another interpretation is that the meaning of life is to find purpose and meaning in
our lives. This view suggests that we are here to seek out our own unique purpose and
to strive to achieve it. This can be achieved through various means, such as through our
work, relationships, or personal pursuits.
A third interpretation is that the meaning of life is to connect with something larger
than ourselves. This view suggests that we are here to connect with a higher power,
whether it be through religion, spirituality, or a sense of awe and wonder at the
universe. In this sense, the meaning of life is to find a sense of purpose and
connection that transcends our individual lives.
Ultimately, the meaning of life is a question that each person must answer for themselves.
It is a question that requires us to reflect on our own values, beliefs, and experiences.
As the saying goes, "Ask a flower" - the meaning of life is not something that can be
answered in words, but rather in the experience of living itself.
The meaning of life is a question that has been pondered by philosophers, scientists,
and spiritual leaders for centuries. It is a question that has no definitive answer,
as it is deeply personal and subjective to each individual. However, many have
attempted to provide their own interpretations or explanations.
One interpretation of the meaning of life is that it is simply to live and experience
the world around us. This view suggests that the purpose of life is to experience all
that it has to offer, whether it be through the senses, emotions, or intellectual
pursuits. In this sense, the meaning of life is not necessarily tied to any specific goal
or achievement, but rather to the process of living itself.
Another interpretation is that the meaning of life is to find purpose and meaning in
our lives. This view suggests that we are here to seek out our own unique purpose and
to strive to achieve it. This can be achieved through various means, such as through our
work, relationships, or personal pursuits.
A third interpretation is that the meaning of life is to connect with something larger
than ourselves. This view suggests that we are here to connect with a higher power,
whether it be through religion, spirituality, or a sense of awe and wonder at the
universe. In this sense, the meaning of life is to find a sense of purpose and
connection that transcends our individual lives.
Ultimately, the meaning of life is a question that each person must answer for themselves.
It is a question that requires us to reflect on our own values, beliefs, and experiences.
As the saying goes, "Ask a flower" - the meaning of life is not something that can be
answered in words, but rather in the experience of living itself.
Next steps
Congratulations on successfully integrating your fine-tuned Llama3.1 model into the MAX pipelines! 🎉
We have navigated through setting up secure access, downloading and converting models, and finally running your custom model in MAX pipelines.
We encourage you to further customize your models via the MAX Graph API, test your pipeline and explore other MAX features.
Here are some other topics to explore next: