Deploy a PyTorch model from Hugging Face
We designed MAX to simplify the entire AI development workflow, and that includes deploying PyTorch models with a high-performance serving endpoint. As we'll show you in this tutorial, deploying an endpoint with MAX is as simple as deploying a Docker container—you don't have to write any new code to use MAX.
Currently, the MAX container includes a REST API that supports large-language models (LLMs) only, so that's what we'll deploy. Specifically, we'll deploy the Qwen2.5 model, but you can select a different PyTorch LLM from Hugging Face. (See our README for a list of model architectures we currently support.) We've also included instructions to deploy to the cloud provider of your choice, either AWS, GCP, or Azure.
If you want to instead deploy a highly-optimized LLM built with MAX, see Deploy Llama 3 with MAX Serve on GPU.
Deploy to a local endpoint
In this section, you'll download the MAX repository, then use MAX to serve the Qwen2.5 model.
Prerequisites
Before following the steps in this topic, make sure you've downloaded and installed the Hugging Face CLI.
Use the huggingface-cli login
command to authenticate.
huggingface-cli login
huggingface-cli login
This command requests an authentication token. If you don't have one already, Hugging Face's User access tokens topic explains how to create one.
Clone the repository
To start, clone the max repository.
git clone git@github.com:modularml/max.git
git clone git@github.com:modularml/max.git
After you clone the repository, navigate to the max/pipelines/python
directory.
cd max/pipelines/python/
cd max/pipelines/python/
Serve the endpoint
Let's get our endpoint up and running!
We'll use the magic
CLI to create a virtual environment and install
the required packages.
If you don't have the magic
CLI yet, you can install it on macOS
and Ubuntu Linux with this command:
curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the source
command that's printed in your terminal.
The command we'll run is magic run serve
, which takes supports a
huggingface-repo-id
parameter. This parameter allows you to specify an LLM
model hosted on Hugging Face.
magic run serve --huggingface-repo-id=Qwen/Qwen2.5-1.5B-Instruct
magic run serve --huggingface-repo-id=Qwen/Qwen2.5-1.5B-Instruct
This command downloads the model and sets up MAX Serve to host a local endpoint.
The endpoint is ready when you see output similar to the following in your terminal:
uvicorn running on http://0.0.0.0:8000
uvicorn running on http://0.0.0.0:8000
Test the endpoint
Let's test the endpoint! In a new terminal run the following curl
command:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
You should see a response in your command line similar to the following:
The capital city of Mongolia is Ulaanbaatar.
The capital city of Mongolia is Ulaanbaatar.
That's it! In just a few steps, you've connected a Hugging Face LLM model to an endpoint so it can receive and respond to inference requests.
Deploy to a cloud provider
In the first part of this tutorial, you used MAX to deploy a Hugging Face model to a local endpoint. In this next part, you'll use a prebuilt Docker container to deploy a model to a cloud provider.
Prerequisites
This tutorial shows you how to deploy a model to one of three cloud providers:
- AWS
- GCP
- Azure
To complete this tutorial, you should:
- Be familiar with the basics of at least one of these cloud providers
- Have the appropriate CLI tools installed:
- Have a project set up that you can use to deploy the Docker container.
- Verify that you have access to the Qwen2.5 model.
- Enable any billing permissions so you can install the appropriate APIs and launch the designated GPU instances.
Initialize CLI tools
If you haven't already done so, make sure that you've initialized your CLI tools and logged in to your account.
- AWS
- GCP
- Azure
Configure the AWS CLI:
aws configure
aws configure
Login to your AWS account:
aws sso login
aws sso login
Check the credentials via cat ~/.aws/credentials
to make sure it is set up correctly.
You can also include the credentials as environment variables:
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
Initialize the Google Cloud SDK:
gcloud init
gcloud init
Login to your Google Cloud account:
gcloud auth login
gcloud auth login
Initialize the Azure CLI:
az init
az init
Login to your Azure account:
az login
az login
Create your deployment
In this section, you'll go through the steps needed to create a deployment. These steps vary depending on the Cloud provider you prefer to use.
- AWS
- GCP
- Azure
For AWS, we'll create a AWS CloudFormation template to define and configure our deployment.
-
Create a working directory for the Infrastructure as Code files.
mkdir aws
mkdir aws
Then, navigate to that directory.
cd aws
cd aws
-
Set the AWS region. In this case, we'll use
us-east-1
, but you can use whatever region you prefer.export REGION="us-east-1"
export REGION="us-east-1"
-
Create an AWS CloudFormation file,
max-serve-aws.yaml
.touch max-serve-aws.yaml
touch max-serve-aws.yaml
Then, using the editor of your choice, paste the following:
-
Create the stack.
aws cloudformation create-stack --stack-name max-serve-stack \
--template-body file://max-serve-aws.yaml \
--parameters ParameterKey=InstanceType,ParameterValue=p4d.24xlarge \
ParameterKey=HuggingFaceHubToken,ParameterValue=<YOUR_HUGGING_FACE_HUB_TOKEN> \
ParameterKey=HuggingFaceRepoId,ParameterValue=Qwen/Qwen2.5-1.5b-instruct \
--capabilities CAPABILITY_IAM \
--region $REGIONaws cloudformation create-stack --stack-name max-serve-stack \
--template-body file://max-serve-aws.yaml \
--parameters ParameterKey=InstanceType,ParameterValue=p4d.24xlarge \
ParameterKey=HuggingFaceHubToken,ParameterValue=<YOUR_HUGGING_FACE_HUB_TOKEN> \
ParameterKey=HuggingFaceRepoId,ParameterValue=Qwen/Qwen2.5-1.5b-instruct \
--capabilities CAPABILITY_IAM \
--region $REGIONNote that you must replace
<YOUR_HUGGING_FACE_HUB_TOKEN>
with your actual token.In addition, this command defines the model that we want to deploy. For this tutorial, we'll use the Qwen2.5 model.
This deployment can take a few minutes to complete. Track the status of the deployment by running the following command:
aws cloudformation describe-stacks --stack-name max-serve-stack \
--region $REGION --query 'Stacks[0].StackStatus' --output textaws cloudformation describe-stacks --stack-name max-serve-stack \
--region $REGION --query 'Stacks[0].StackStatus' --output textWhen the CloudFormation stack is deployed, you should see a status of
CREATE_COMPLETE
. Typeq
to exit this prompt in your CLI.
For GCP, we'll create a .jinja
and .yaml
file to define and
configure our deployment.
-
Create a working directory for the Infrastructure as Code files.
mkdir gcp
mkdir gcp
Then, navigate to that directory.
cd gcp
cd gcp
-
Next, let's define a PROJECT_ID variable, which you'll use for some of the other commands you'll run later.
PROJECT_ID="YOUR_PROJECT_ID"
PROJECT_ID="YOUR_PROJECT_ID"
Remember to replace
YOUR_PROJECT_ID
with the ID of your GCP project. -
Enable the following APIs by running the following command:
gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable compute.googleapis.com --project=${PROJECT_ID}gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable compute.googleapis.com --project=${PROJECT_ID} -
Create a file,
max-serve-gcp.jinja
.touch max-serve-gcp.jinja
touch max-serve-gcp.jinja
Then, using the editor of your choice, paste in the following:
This file contains a couple of variables:
- hugging_face_hub_token: Defines your Hugging Face hub token so you can access the appropriate model
- pytorch_model: Defines the PyTorch model that you want to deploy.
We'll define those variables in the next section.
-
Your next step is to define the deployment. This deployment file defines a number of properties, in particular the model that we want to deploy. For this tutorial, we'll use the Qwen2.5 model.
In your working directory, create a file,
max-serve-gcp.yaml
.touch max-serve-gcp.yaml
touch max-serve-gcp.yaml
Then, using the editor of your choice, paste in the following:
-
Create your deployment by running the following command:
gcloud deployment-manager deployments create max-serve-deployment \
--config max-serve-gcp.yaml \
--project ${PROJECT_ID}gcloud deployment-manager deployments create max-serve-deployment \
--config max-serve-gcp.yaml \
--project ${PROJECT_ID}The deployment might take a few minutes to complete. To track the status of the deployment, run the following command:
gcloud deployment-manager deployments describe max-serve-deployment \
--project=${PROJECT_ID}gcloud deployment-manager deployments describe max-serve-deployment \
--project=${PROJECT_ID}
-
Create a working directory for the Infrastructure as Code files.
mkdir azure
mkdir azure
Then, navigate to that directory.
cd azure
cd azure
-
Set the Azure region. In this case, we'll use
eastus
, but you can use whatever region you prefer.export REGION="eastus"
export REGION="eastus"
-
Create the resource group.
az group create --name maxServeResourceGroup --location $REGION
az group create --name maxServeResourceGroup --location $REGION
The following is the expected output:
{
"id": "/subscriptions/SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP_NAME",
"location": "eastus",
"managedBy": null,
"name": "RESOURCE_GROUP_NAME",
"properties": {
"provisioningState": "Succeeded"
},
"tags": null,
"type": "Microsoft.Resources/resourceGroups"
}{
"id": "/subscriptions/SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP_NAME",
"location": "eastus",
"managedBy": null,
"name": "RESOURCE_GROUP_NAME",
"properties": {
"provisioningState": "Succeeded"
},
"tags": null,
"type": "Microsoft.Resources/resourceGroups"
} -
Verify that the resource group was created successfully:
az group show -n maxServeResourceGroup --query properties.provisioningState -o tsv
az group show -n maxServeResourceGroup --query properties.provisioningState -o tsv
The following is the expected output:
Succeeded
Succeeded
-
Create a file named
startup.sh
and paste in the following contents:Then, encode the script using base64:
base64 -i startup.sh | tr -d '\n' > encoded-script.txt
base64 -i startup.sh | tr -d '\n' > encoded-script.txt
Use the output of this script for the placeholder
<ENCODED_STARTUP_SCRIPT>
in the next step. -
Create a new file,
parameters.json
and paste in the following contents.Be sure to replace
<ENCODED_STARTUP_SCRIPT>
with the encoded output from the previous step, and<YOUR_SECURE_PASSWORD>
with your own secure password. -
Create a new file,
max-serve-azure.json
and paste in the following: -
Create the deployment.
az deployment group create \
--name maxServeDeployment \
--resource-group maxServeResourceGroup \
--template-file max-serve-azure.json \
--parameters @parameters.json location="$REGION"az deployment group create \
--name maxServeDeployment \
--resource-group maxServeResourceGroup \
--template-file max-serve-azure.json \
--parameters @parameters.json location="$REGION" -
Track the status of the deployment by running the following command:
az deployment group wait --name maxServeDeployment \
--resource-group maxServeResourceGroup \
--createdaz deployment group wait --name maxServeDeployment \
--resource-group maxServeResourceGroup \
--created
Retrieve instance information
At this point, you should have confirmation that your instance is up and running! Let's get some of the information we need to test the deployment.
- AWS
- GCP
- Azure
Let's get the instance ID and public IP address and assign them to environment variables:
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name max-serve-stack --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" --output text --region $REGION)
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].PublicIpAddress' --output text --region $REGION)
echo "Instance ID: $INSTANCE_ID"
echo "Public IP: $PUBLIC_IP"
aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name max-serve-stack --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" --output text --region $REGION)
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].PublicIpAddress' --output text --region $REGION)
echo "Instance ID: $INSTANCE_ID"
echo "Public IP: $PUBLIC_IP"
aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION
-
Get the instance name and zone. Be sure to update the
INSTANCE_NAME
variable if you changed it frommax-serve-instance
.INSTANCE_NAME=max-serve-instance
ZONE=$(gcloud compute instances list \
--filter="name:${INSTANCE_NAME}" \
--format="value(zone)")
echo "Instance Name: $INSTANCE_NAME"
echo "Zone: $ZONE"INSTANCE_NAME=max-serve-instance
ZONE=$(gcloud compute instances list \
--filter="name:${INSTANCE_NAME}" \
--format="value(zone)")
echo "Instance Name: $INSTANCE_NAME"
echo "Zone: $ZONE" -
Add a tag to the instance.
gcloud compute instances add-tags "${INSTANCE_NAME}" \
--project=${PROJECT_ID} \
--zone "${ZONE}" \
--tags http-servergcloud compute instances add-tags "${INSTANCE_NAME}" \
--project=${PROJECT_ID} \
--zone "${ZONE}" \
--tags http-server -
Retrieve the public IP address for the instance:
PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \
--zone "${ZONE}" \
--format="get(networkInterfaces[0].accessConfigs[0].natIP)" \
--project=${PROJECT_ID})
echo "Public IP: $PUBLIC_IP"PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \
--zone "${ZONE}" \
--format="get(networkInterfaces[0].accessConfigs[0].natIP)" \
--project=${PROJECT_ID})
echo "Public IP: $PUBLIC_IP"
Get the public IP address of our deployment.
PUBLIC_IP=$(az network public-ip show \
--resource-group maxServeResourceGroup \
--name maxServePublicIP \
--query ipAddress -o tsv)
PUBLIC_IP=$(az network public-ip show \
--resource-group maxServeResourceGroup \
--name maxServePublicIP \
--query ipAddress -o tsv)
Test the endpoint
We've confirmed that the instance is available. However, it can still take a few
minutes to pull the MAX Docker image and start it. In this section, you'll learn
how to check to see if the service is ready to receive inference requests, then
run a curl
command to send and receive a request to the container.
- AWS
- GCP
- Azure
To track when the instance is ready, you can use the AWS CloudWatch console to
view the log group, /aws/ec2/max-serve-stack-logs
and find the logs
for instance-logs
. Alternatively, you can use the
following bash script:
The instance is ready when you can see a log entry similar to the following:
Uvicorn running on http://0.0.0.0:8000
Uvicorn running on http://0.0.0.0:8000
After you see this log entry, you can test the endpoint by running the following
curl
command:
curl -N http://$PUBLIC_IP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5b-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://$PUBLIC_IP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5b-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
-
Assign the instance ID to an environment variable,
INSTANCE_ID
.INSTANCE_ID=$(gcloud compute instances describe ${INSTANCE_NAME} \
--zone=${ZONE} \
--project=${PROJECT_ID} \
--format="value(id)")INSTANCE_ID=$(gcloud compute instances describe ${INSTANCE_NAME} \
--zone=${ZONE} \
--project=${PROJECT_ID} \
--format="value(id)") -
Get the current logs by running the following command:
gcloud logging read \
"resource.type=gce_instance AND \
resource.labels.instance_id=${INSTANCE_ID} AND \
jsonPayload.message:*" \
--project=${PROJECT_ID} \
--format="table(timestamp,jsonPayload.message)" \
--limit=10gcloud logging read \
"resource.type=gce_instance AND \
resource.labels.instance_id=${INSTANCE_ID} AND \
jsonPayload.message:*" \
--project=${PROJECT_ID} \
--format="table(timestamp,jsonPayload.message)" \
--limit=10The instance is ready when you can see a log entry similar to the following:
uvicorn running on http://0.0.0.0:8000
uvicorn running on http://0.0.0.0:8000
-
Test the endpoint by sending the following
curl
request:curl -N http://$PUBLIC_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'curl -N http://$PUBLIC_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
-
Verify that the container is running.
ssh azuresuer@$PUBLIC_IP
# Use the password that you set in your parameters.json file.
sudo cat /var/log/azure/custom-script/handler.log
sudo cat /var/lib/waagent/custom-script/download/0/stdout
sudo cat /var/lib/waagent/custom-script/download/0/stderrssh azuresuer@$PUBLIC_IP
# Use the password that you set in your parameters.json file.
sudo cat /var/log/azure/custom-script/handler.log
sudo cat /var/lib/waagent/custom-script/download/0/stdout
sudo cat /var/lib/waagent/custom-script/download/0/stderrThe instance is ready when you can see a log entry similar to the following:
uvicorn running on http://0.0.0.0:8000
uvicorn running on http://0.0.0.0:8000
-
Test the endpoint by sending the following
curl
request:curl -N http://$PUBLIC_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'curl -N http://$PUBLIC_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
You should see a response in your command line similar to the following:
The capital city of Mongolia is Ulaanbaatar.
The capital city of Mongolia is Ulaanbaatar.
Delete the cloud resources
Take a few minutes to explore your deployment. When you're finished, be sure to delete the resources created in this tutorial so you don't incur any unnecessary charges.
- AWS
- GCP
- Azure
-
Delete the stack.
aws cloudformation delete-stack --stack-name max-serve-stack
aws cloudformation delete-stack --stack-name max-serve-stack
-
Verify that the stack deleted successfully.
aws cloudformation describe-stacks --stack-name max-serve-stack \
--region $REGION --query 'Stacks[0].StackStatus' --output textaws cloudformation describe-stacks --stack-name max-serve-stack \
--region $REGION --query 'Stacks[0].StackStatus' --output text
gcloud deployment-manager deployments delete max-serve-deployment \
--project=${PROJECT_ID}
gcloud deployment-manager deployments delete max-serve-deployment \
--project=${PROJECT_ID}
az group delete --name maxServeResourceGroup
az group delete --name maxServeResourceGroup
Next steps
In this tutorial, you've deployed a Hugging Face Pytorch model to the cloud using a MAX Docker container.
Keep in mind that this is just a preview of MAX Serve for PyTorch models and it's currently compatible with LLMs only. We're working on support for more models and more model optimizations with the MAX graph compiler.
Here are some other topics to explore next:
Deploy Llama 3 on GPU with MAX Serve
Learn how to deploy Llama 3 on GPU with MAX Serve.
Benchmark MAX Serve on an NVIDIA A100 GPU
Learn how to use our benchmarking script to measure the performance of MAX Serve.
Bring your own fine-tuned model to MAX pipelines
Learn how to customize your own model in MAX pipelines.
Deploy Llama 3 on GPU-powered Kubernetes clusters
Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs
To stay up to date with new releases, sign up for our newsletter and join our community. And if you're interested in becoming a design partner to get early access and give us feedback, please contact us.
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!