Deploy Llama 3 on GPU with MAX Serve
This guide walks through serving Llama 3 models with MAX Serve, from local testing to production deployment on major cloud platforms. You'll learn to automate the deployment process using Infrastructure-as-Code (Iac) and optimize performance with GPU resources.
MAX Serve provides a streamlined way to deploy large language models (LLMs) with production-ready features like GPU acceleration, automatic scaling, and monitoring capabilities. Whether you're building a prototype or preparing for production deployment, this guide will help you set up a robust serving infrastructure for Llama 3.
The tutorial is organized into the following sections:
- Local setup: Run Llama 3 locally to verify its basic functionality.
- Cloud deployment: Deploy Llama 3 to AWS, GCP, or Azure using IaC templates and CLI commands.
Local setup
In this section, you will set up and run Llama 3 locally to understand its capabilities and validate functionality before moving to the cloud.
1. Clone the MAX repository
To get started, let's clone the MAX repository and navigate to the appropriate directory:
git clone https://github.com/modularml/max && cd max/pipelines/python
git clone https://github.com/modularml/max && cd max/pipelines/python
2. Run Llama 3 locally
Next, use the magic
CLI tool to interact with the Llama 3 model locally and
ensure that the model runs as expected before deploying it in the cloud.
- Generate a response to a prompt with the following command:
magic run llama3 --prompt "What is the meaning of life?"
magic run llama3 --prompt "What is the meaning of life?"
- Start the model server using
magic run serve
. The--huggingface-repo-id
flag specifies which model to load.
magic run serve --huggingface-repo-id modularai/llama-3.1
magic run serve --huggingface-repo-id modularai/llama-3.1
This starts a local server where you can test Llama 3's response generation capabilities. You should see the following log:
Uvicorn running on http://0.0.0.0:8000
Uvicorn running on http://0.0.0.0:8000
For more information on MAX-optimized models, see the Python Pipelines README.
3. Test the local endpoint
After starting the model server, you can test its functionality by sending a
curl
request:
curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
After starting your server, you can go to http://0.0.0.0:8000/docs to learn more about available endpoints and API specifications.
Now that the model works locally, we'll transition to cloud deployment.
Cloud deployment paths
We will use Infrastructure-as-Code (IaC) to create, configure, and deploy Llama 3 in the cloud. The cloud deployment instructions are divided by provider: AWS, GCP, and Azure.
Cloud deployment overview
For AWS, we will use CloudFormation, for GCP, we will use Deployment Manager, and for Azure, we will use Resource Manager. These IaC templates handle resource provisioning, networking, and security configuration. This approach simplifies deployments and ensures they are repeatable.
The key steps are:
- Create and Deploy Stack/Resources: Use IaC templates for each cloud provider to deploy Llama 3.
- Test the Endpoint: Retrieve the public IP address after deployment and send a request to test the Llama 3 endpoint in the cloud.
Each cloud-specific tab provides complete commands for setup, configuration, deployment, and testing.
To better understand the flow of the deployment, here is a high-level overview of the architecture:
This architecture diagram illustrates the two-phase deployment setup for serving the Llama 3 model with MAX on cloud provider infrastructure.
The deployment process is divided into two phases:
- Phase 1: Cloud stack creation: In this initial phase, the following
infrastructure is provisioned and configured to prepare for serving requests:
- Public IP assignment: The cloud provider assigns a public IP to the virtual machine (VM), allowing it to be accessed externally.
- Firewall/Security group configuration: Security settings, such as firewall rules or security groups, are applied to allow traffic on port 80. This setup ensures that only HTTP requests can access the instance securely.
- GPU compute instance setup: A GPU-enabled VM is created to handle model
inference efficiently. This instance includes:
- GPU drivers/runtime installation: Necessary GPU drivers and runtime libraries are installed to enable hardware acceleration for model processing.
- Docker container initialization: A Docker container is launched on the VM, where it pulls the necessary images from the Docker Container Registry. This registry serves as a central repository for storing Docker images, making it easy to deploy and update the application.
Inside the container, MAX Serve is set up alongside the Llama 3 model. This setup prepares the environment for serving requests but does not yet expose the endpoint to users.
- Phase 2: Serving the user endpoint: Once the cloud stack is configured and
the VM is set up, the deployment enters the second phase, where it starts
serving user requests:
- HTTP endpoint exposure: With the VM and Docker container ready, the system opens an OpenAI compatible HTTP endpoint on port 80, allowing users to interact with the deployed Llama 3 model.
- Request handling by MAX Serve: When a user sends an HTTP request to the public IP, MAX Serve processes the incoming request within the Docker container and forwards it to the Llama 3 model for inference. The model generates a response, which is then returned to the user via the endpoint.
Prerequisites
Be sure that you have the following prerequisites, as well as appropriate access and permissions for the cloud provider of your choice.
- GPU resources: You'll need access to GPU resources in your cloud account
with the following specifications:
- Minimum GPU memory: 24GB
- Supported GPU types: NVIDIA A100 (most optimized), A10G, L4 and L40
-
A Hugging Face user access token: A valid Hugging Face token is required to access the model. To create a Hugging Face user access token, see Access Tokens. You must make your token available in your environment with the following command:
export HUGGING_FACE_HUB_TOKEN="<YOUR-HUGGING-FACE-HUB-TOKEN>"
export HUGGING_FACE_HUB_TOKEN="<YOUR-HUGGING-FACE-HUB-TOKEN>"
-
Docker installation: Install the Docker Engine and CLI. We use a pre-configured GPU-enabled Docker container from our public repository. The container image (
modular/max-openai-api:24.6.0
) is available on Docker Hub. For more information, see MAX container. -
Cloud CLI tools: Before deploying, ensure that you have the respective cloud provider CLI tools installed.
- AWS CLI v2 installed and configured with appropriate credentials
- Google Cloud SDK installed and initialized
- Azure CLI installed and logged in and configured
- AWS
- GCP
- Azure
Configure the AWS CLI:
aws configure
aws configure
Log in to your AWS account:
aws sso login
aws sso login
Check the credentials via cat ~/.aws/credentials
to make sure it is set up
correctly. You can also include the credentials as environment variables:
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
Initialize the Google Cloud SDK:
gcloud init
gcloud init
Log in to your Google Cloud account:
gcloud auth login
gcloud auth login
Initialize the Azure CLI:
az init
az init
Log in to your Azure account:
az login
az login
1. Create stack/deployment
In this section, we'll walk through creating a deployment stack on AWS, GCP, and Azure. Each cloud provider has its own configuration steps, detailed below, but we simplify the setup by using Infrastructure-as-Code (IaC) templates.
Start by navigating to the max/tutorials/max-serve-cloud-configs/
directory,
where the necessary IaC templates and configuration files are organized for each
cloud provider. This directory includes all files required to deploy the MAX
Serve setup to AWS, GCP, or Azure:
max/tutorials/max-serve-cloud-configs/
├── aws
│ ├── max-serve-aws.yaml
│ └── notify.sh
├── azure
│ ├── max-serve-azure.json
│ └── notify.sh
└── gcp
├── max-serve-gcp.jinja
└── notify.sh
max/tutorials/max-serve-cloud-configs/
├── aws
│ ├── max-serve-aws.yaml
│ └── notify.sh
├── azure
│ ├── max-serve-azure.json
│ └── notify.sh
└── gcp
├── max-serve-gcp.jinja
└── notify.sh
With these IaC templates ready, choose your preferred cloud provider and follow the step-by-step instructions specific to each platform.
- AWS
- GCP
- Azure
First navigate to the AWS directory:
cd aws
cd aws
Set the region in your environment:
export REGION="REGION" # example: `us-east-1`
export REGION="REGION" # example: `us-east-1`
Then, create the stack. You can explore the max-serve-aws.yaml
file for AWS
CloudFormation configuration information.
export STACK_NAME="max-serve-stack"
aws cloudformation create-stack --stack-name ${STACK_NAME} \
--template-body file://max-serve-aws.yaml \
--parameters \
ParameterKey=InstanceType,ParameterValue=g5.4xlarge \
ParameterKey=HuggingFaceHubToken,ParameterValue=${HUGGING_FACE_HUB_TOKEN} \
ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/llama-3.1 \
--capabilities CAPABILITY_IAM \
--region $REGION
export STACK_NAME="max-serve-stack"
aws cloudformation create-stack --stack-name ${STACK_NAME} \
--template-body file://max-serve-aws.yaml \
--parameters \
ParameterKey=InstanceType,ParameterValue=g5.4xlarge \
ParameterKey=HuggingFaceHubToken,ParameterValue=${HUGGING_FACE_HUB_TOKEN} \
ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/llama-3.1 \
--capabilities CAPABILITY_IAM \
--region $REGION
First, navigate to the GCP directory:
cd gcp
cd gcp
Set the project ID:
PROJECT_ID="YOUR PROJECT ID"
export ZONE="ZONE" # example `us-east1-d`
PROJECT_ID="YOUR PROJECT ID"
export ZONE="ZONE" # example `us-east1-d`
Enable the required APIs:
gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable compute.googleapis.com --project=${PROJECT_ID}
gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable compute.googleapis.com --project=${PROJECT_ID}
Create the deployment with the following command. You can explore the
max-serve-gcp.jinja
file for more information on the Deployment Manager
configuration.
export DEPLOYMENT_NAME="max-serve-deployment"
export INSTANCE_NAME="max-serve-instance"
gcloud deployment-manager deployments create ${DEPLOYMENT_NAME} \
--template max-serve-gcp.jinja \
--properties "\
instanceName:${INSTANCE_NAME},\
zone:${ZONE},\
machineType:g2-standard-8,\
acceleratorType:nvidia-l4,\
acceleratorCount:1,\
sourceImage:common-cu123-v20240922-ubuntu-2204-py310,\
huggingFaceHubToken:${HUGGING_FACE_HUB_TOKEN},\
huggingFaceRepoId:modularai/llama-3.1" \
--project ${PROJECT_ID}
export DEPLOYMENT_NAME="max-serve-deployment"
export INSTANCE_NAME="max-serve-instance"
gcloud deployment-manager deployments create ${DEPLOYMENT_NAME} \
--template max-serve-gcp.jinja \
--properties "\
instanceName:${INSTANCE_NAME},\
zone:${ZONE},\
machineType:g2-standard-8,\
acceleratorType:nvidia-l4,\
acceleratorCount:1,\
sourceImage:common-cu123-v20240922-ubuntu-2204-py310,\
huggingFaceHubToken:${HUGGING_FACE_HUB_TOKEN},\
huggingFaceRepoId:modularai/llama-3.1" \
--project ${PROJECT_ID}
First, navigate to the Azure directory:
cd azure
cd azure
Set the region:
export REGION="REGION" # example `westus3`
export REGION="REGION" # example `westus3`
Then, create the resource group:
export RESOURCE_GROUP_NAME="maxServeResourceGroup"
export DEPLOYMENT_NAME="maxServeDeployment"
az group create --name ${RESOURCE_GROUP_NAME} --location ${REGION}
export RESOURCE_GROUP_NAME="maxServeResourceGroup"
export DEPLOYMENT_NAME="maxServeDeployment"
az group create --name ${RESOURCE_GROUP_NAME} --location ${REGION}
Check the status of the resource group:
az group show -n ${RESOURCE_GROUP_NAME} --query properties.provisioningState -o tsv
az group show -n ${RESOURCE_GROUP_NAME} --query properties.provisioningState -o tsv
Create and encode the startup script:
STARTUP_SCRIPT='#!/bin/bash
sudo usermod -aG docker $USER
sudo systemctl restart docker
sleep 10
HUGGING_FACE_HUB_TOKEN=$1
HUGGING_FACE_REPO_ID=${2:-modularai/llama-3.1}
sudo docker run -d \
--restart unless-stopped \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
--gpus 1 \
-p 80:8000 \
--ipc=host \
modular/max-openai-api:24.6.0 \
--huggingface-repo-id ${HUGGING_FACE_REPO_ID}'
export STARTUP_SCRIPT=$(echo "$STARTUP_SCRIPT" | base64)
STARTUP_SCRIPT='#!/bin/bash
sudo usermod -aG docker $USER
sudo systemctl restart docker
sleep 10
HUGGING_FACE_HUB_TOKEN=$1
HUGGING_FACE_REPO_ID=${2:-modularai/llama-3.1}
sudo docker run -d \
--restart unless-stopped \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
--gpus 1 \
-p 80:8000 \
--ipc=host \
modular/max-openai-api:24.6.0 \
--huggingface-repo-id ${HUGGING_FACE_REPO_ID}'
export STARTUP_SCRIPT=$(echo "$STARTUP_SCRIPT" | base64)
Then, create the deployment:
export VM_PASSWORD="YOUR-SECURE-PASSWORD-123"
az deployment group create \
--name ${DEPLOYMENT_NAME} \
--resource-group ${RESOURCE_GROUP_NAME} \
--template-file max-serve-azure.json \
--parameters \
adminUsername="azureuser" \
adminPassword=${VM_PASSWORD} \
vmSize="Standard_NV36ads_A10_v5" \
osDiskSizeGB=128 \
vnetAddressPrefix="10.0.0.0/16" \
subnetAddressPrefix="10.0.0.0/24" \
startupScript="${STARTUP_SCRIPT}" \
location="${REGION}"
export VM_PASSWORD="YOUR-SECURE-PASSWORD-123"
az deployment group create \
--name ${DEPLOYMENT_NAME} \
--resource-group ${RESOURCE_GROUP_NAME} \
--template-file max-serve-azure.json \
--parameters \
adminUsername="azureuser" \
adminPassword=${VM_PASSWORD} \
vmSize="Standard_NV36ads_A10_v5" \
osDiskSizeGB=128 \
vnetAddressPrefix="10.0.0.0/16" \
subnetAddressPrefix="10.0.0.0/24" \
startupScript="${STARTUP_SCRIPT}" \
location="${REGION}"
2. Wait for resources to be ready
In this step, we'll wait for the resources to be ready. Stack and deployment creation may take some time to complete.
- AWS
- GCP
- Azure
aws cloudformation wait stack-create-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
aws cloudformation wait stack-create-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
gcloud deployment-manager deployments describe ${DEPLOYMENT_NAME} \
--project=${PROJECT_ID}
gcloud deployment-manager deployments describe ${DEPLOYMENT_NAME} \
--project=${PROJECT_ID}
Wait for the deployment to be completed and report its status:
az deployment group wait \
--name ${DEPLOYMENT_NAME} \
--resource-group ${RESOURCE_GROUP_NAME} \
--created
az deployment group wait \
--name ${DEPLOYMENT_NAME} \
--resource-group ${RESOURCE_GROUP_NAME} \
--created
3. Retrieve instance information
After the resources are deployed, you'll need to get the instance information, such as the public DNS or IP address that we will use to test the endpoint.
- AWS
- GCP
- Azure
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \
--query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \
--output text \
--region ${REGION})
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
--query 'Reservations[0].Instances[0].PublicIpAddress' \
--output text \
--region ${REGION})
echo "Instance ID: ${INSTANCE_ID}"
echo "Public IP: ${PUBLIC_IP}"
aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION}
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \
--query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \
--output text \
--region ${REGION})
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
--query 'Reservations[0].Instances[0].PublicIpAddress' \
--output text \
--region ${REGION})
echo "Instance ID: ${INSTANCE_ID}"
echo "Public IP: ${PUBLIC_IP}"
aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION}
First, check if the firewall rule already exists:
EXISTING_RULE=$(gcloud compute firewall-rules list \
--filter="name=allow-http" \
--format="value(name)" \
--project=${PROJECT_ID})
if [ -z "$EXISTING_RULE" ]; then
echo "Creating firewall rule..."
gcloud compute firewall-rules create allow-http \
--allow tcp:80 \
--source-ranges 0.0.0.0/0 \
--target-tags http-server \
--description "Allow HTTP traffic on port 80" \
--project=${PROJECT_ID}
else
echo "Firewall rule 'allow-http' already exists"
fi
EXISTING_RULE=$(gcloud compute firewall-rules list \
--filter="name=allow-http" \
--format="value(name)" \
--project=${PROJECT_ID})
if [ -z "$EXISTING_RULE" ]; then
echo "Creating firewall rule..."
gcloud compute firewall-rules create allow-http \
--allow tcp:80 \
--source-ranges 0.0.0.0/0 \
--target-tags http-server \
--description "Allow HTTP traffic on port 80" \
--project=${PROJECT_ID}
else
echo "Firewall rule 'allow-http' already exists"
fi
Check if the instance exists and tag it with http-server
:
INSTANCE_EXISTS=$(gcloud compute instances list \
--filter="name=${INSTANCE_NAME}" \
--format="value(name)" \
--project=${PROJECT_ID})
if [ -n "$INSTANCE_EXISTS" ]; then
echo "Adding tags to instance ${INSTANCE_NAME}"
gcloud compute instances add-tags "${INSTANCE_NAME}" \
--project=${PROJECT_ID} \
--zone "${ZONE}" \
--tags http-server
else
echo "Error: Instance ${INSTANCE_NAME} not found"
exit 1
fi
INSTANCE_EXISTS=$(gcloud compute instances list \
--filter="name=${INSTANCE_NAME}" \
--format="value(name)" \
--project=${PROJECT_ID})
if [ -n "$INSTANCE_EXISTS" ]; then
echo "Adding tags to instance ${INSTANCE_NAME}"
gcloud compute instances add-tags "${INSTANCE_NAME}" \
--project=${PROJECT_ID} \
--zone "${ZONE}" \
--tags http-server
else
echo "Error: Instance ${INSTANCE_NAME} not found"
exit 1
fi
Then, get the public IP:
PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \
--zone "${ZONE}" \
--format="get(networkInterfaces[0].accessConfigs[0].natIP)" \
--project=${PROJECT_ID})
echo "Public IP: $PUBLIC_IP"
PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \
--zone "${ZONE}" \
--format="get(networkInterfaces[0].accessConfigs[0].natIP)" \
--project=${PROJECT_ID})
echo "Public IP: $PUBLIC_IP"
PUBLIC_IP=$(az network public-ip show \
--resource-group ${RESOURCE_GROUP_NAME} \
--name maxServePublicIP \
--query ipAddress -o tsv)
echo "Public IP: ${PUBLIC_IP}"
PUBLIC_IP=$(az network public-ip show \
--resource-group ${RESOURCE_GROUP_NAME} \
--name maxServePublicIP \
--query ipAddress -o tsv)
echo "Public IP: ${PUBLIC_IP}"
4. Test the endpoint
We will use the public IP address that we obtained from previous step to test
the endpoint with the following curl
request:
curl -N http://$PUBLIC_IP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}'
curl -N http://$PUBLIC_IP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}'
5. Delete the cloud resources
Cleaning up resources to avoid unwanted costs is critical. Use the following commands to delete resources for each platform. This section provides steps to safely terminate all resources used in the tutorial.
- AWS
- GCP
- Azure
First, delete the stack:
aws cloudformation delete-stack --stack-name ${STACK_NAME}
aws cloudformation delete-stack --stack-name ${STACK_NAME}
Wait for the stack to be deleted:
aws cloudformation wait stack-delete-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
aws cloudformation wait stack-delete-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
gcloud deployment-manager deployments delete ${DEPLOYMENT_NAME} \
--project=${PROJECT_ID}
gcloud deployment-manager deployments delete ${DEPLOYMENT_NAME} \
--project=${PROJECT_ID}
az group delete --name ${RESOURCE_GROUP_NAME}
az group delete --name ${RESOURCE_GROUP_NAME}
Cost estimate
When deploying Llama 3 in a cloud environment, several cost factors come into play:
Primary cost components:
- Compute Resources: GPU instances (like AWS
g5.4xlarge
, GCPg2-standard-8
, or AzureStandard_NV36ads_A10_v5
) form the bulk of the costs - Network Transfer: Costs associated with data ingress/egress, which is critical for high-traffic applications
- Storage: Expenses for boot volumes and any additional storage requirements
- Additional Services: Costs for logging, monitoring, and other supporting cloud services
For detailed cost estimates specific to your use case, we recommend using these official pricing calculators:
Next steps
Congratulations on successfully running MAX Pipelines locally and deploying Llama 3 to the cloud! 🎉
Now that you've mastered the essentials of setting up and deploying the Llama 3 model with MAX Serve, here are some other topics to explore next:
Deploy a PyTorch model from Hugging Face
Learn how to deploy a PyTorch model to the cloud using MAX Serve.
Get started with MAX Graph in Python
Learn how to build a model graph with our Python API for inference with MAX Engine.
Bring your own fine-tuned model to MAX pipelines
Learn how to customize your own model in MAX pipelines.
Deploy Llama 3.1 on GPU-powered Kubernetes clusters
Learn how to deploy Llama 3.1 using Kubernetes, MAX, and NVIDIA GPUs
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!