Deploy Llama 3 on GPU-powered Kubernetes clusters
MAX simplifies the process to deploy an LLM with high performance on GPUs. And if you want to deploy at scale using Kubernetes' built-in monitoring, scaling, and cluster management, then you're in the right place. In this tutorial, you'll learn how to deploy our MAX container on Kubernetes, using your pick of AWS, GCP, or Azure.
In this tutorial, you'll deploy Llama 3.1 as a containerized service using MAX Serve, our GPU-optimized inference server, running on a Kubernetes cluster. You'll create a GPU-enabled Kubernetes cluster with your chosen cloud provider (AWS, GCP, or Azure), then use Helm to deploy the MAX Serve container, which provides a REST API endpoint for making inference requests to Llama 3.1.
MAX Serve supports the following NVIDIA GPU instances:
- A100
- A10
- L4
- L40
While this tutorial uses A100 instances, you can use any of the supported instances with minor modifications to the CLI commands.
Install the required tools
Most of this tutorial involves interaction with your cloud service, so make sure you have the appropriate access and permissions. Most importantly, this tutorial uses GPU-powered Kubernetes clusters that may require special privileges.
To get started, select your cloud provider below and install the corresponding required tools.
- AWS
- GCP
- Azure
To work with AWS, you'll need to install and configure two command-line tools. Begin by installing the AWS CLI using the AWS CLI installation guide, then install eksctl following the eksctl installation guide.
After installation, authenticate your AWS account using:
aws configure
aws configure
This will prompt you for your AWS credentials. For a complete setup walkthrough, refer to the AWS authentication guide or the Amazon EKS setup documentation.
Start by installing the Google Cloud CLI following the GCP CLI installation guide.
Once installed, authenticate and configure your GCP environment:
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
gcloud config set compute/region us-central1-a
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
gcloud config set compute/region us-central1-a
For additional configuration options, see the GCP authentication guide.
Begin by installing the Azure CLI following the Azure CLI installation guide.
Once installed, authenticate your Azure account:
az login
az login
This command will open your default browser to complete the authentication process. For additional authentication methods, consult the Azure authentication guide.
- Install required tools:
- Install kubectl: Follow the kubectl installation guide.
- Install Helm: Follow the Helm installation guide.
Now that you have the prerequisites out of the way, you can create a Kubernetes cluster with GPU nodes on your preferred cloud provider.
Create a Kubernetes cluster with GPU nodes
To get started, you'll need a Kubernetes cluster equipped with GPU nodes to handle the compute demands of LLM inference. We recommend using NVIDIA's A100 instances for their high performance and efficiency in AI workloads.
- AWS
- GCP
- Azure
Run the following command to create a cluster with an full OpenID Connect (OIDC) provider for authentication, private networking, full Elastic Container Registry (ECR) access, and multi-zone deployment:
eksctl create cluster \
--name max-cluster \
--region us-east-1 \
--node-type p4d.24xlarge \
--nodes 1
eksctl create cluster \
--name max-cluster \
--region us-east-1 \
--node-type p4d.24xlarge \
--nodes 1
For more information on eksctl create cluster
, see
Create an Amazon EKS Cluster.
Run the following command to create a GKE cluster with GPU nodes configured with autoscaling and network policies:
gcloud container clusters create max-cluster \
--region us-central1 \
--node-locations us-central1-a \
--machine-type a2-highgpu-1g \
--num-nodes 1 \
--accelerator type=nvidia-tesla-a100,count=1
gcloud container clusters create max-cluster \
--region us-central1 \
--node-locations us-central1-a \
--machine-type a2-highgpu-1g \
--num-nodes 1 \
--accelerator type=nvidia-tesla-a100,count=1
Then set up the required NVIDIA driver:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
For more information on gcloud container clusters create
, see
Creating a zonal cluster.
First, run the following command to create a resource group in your chosen region:
az group create --name my-resource-group --location eastus
az group create --name my-resource-group --location eastus
Then, run the following command to create the AKS cluster:
az aks create \
--resource-group my-resource-group \
--name max-cluster \
--node-count 1 \
--generate-ssh-keys \
--node-vm-size "standard_nc24ads_a100_v4"
az aks create \
--resource-group my-resource-group \
--name max-cluster \
--node-count 1 \
--generate-ssh-keys \
--node-vm-size "standard_nc24ads_a100_v4"
After the cluster is created, configure your local environment to connect to it by retrieving the cluster credentials:
az aks get-credentials --resource-group my-resource-group --name max-cluster
az aks get-credentials --resource-group my-resource-group --name max-cluster
For more information on az aks create
, see
Deploy an AKS cluster using Azure CLI.
Set up a Kubernetes namespace
Next, we'll create a dedicated namespace:
kubectl create namespace max-openai-api-demo
kubectl create namespace max-openai-api-demo
Then set this namespace as our default:
kubectl config set-context --current --namespace=max-openai-api-demo
kubectl config set-context --current --namespace=max-openai-api-demo
Deploy using Helm
- AWS
- GCP
- Azure
Now we'll deploy the Llama 3.1 model graph with MAX using Helm:
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version 24.6.0 \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/llama-3.1 \
--set maxServe.maxLength=512 \
--set maxServe.maxCacheBatchSize=16 \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--timeout 15m0s \
--wait
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version 24.6.0 \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/llama-3.1 \
--set maxServe.maxLength=512 \
--set maxServe.maxCacheBatchSize=16 \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--timeout 15m0s \
--wait
Now we'll deploy the Llama 3.1 model graph with MAX using Helm:
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version 24.6.0 \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/llama-3.1 \
--set maxServe.maxLength=512 \
--set maxServe.maxCacheBatchSize=16 \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--set "resources.limits.nvidia\\.com/gpu=1" \
--set "resources.requests.nvidia\\.com/gpu=1" \
--timeout 15m0s \
--wait
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version 24.6.0 \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/llama-3.1 \
--set maxServe.maxLength=512 \
--set maxServe.maxCacheBatchSize=16 \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--set "resources.limits.nvidia\\.com/gpu=1" \
--set "resources.requests.nvidia\\.com/gpu=1" \
--timeout 15m0s \
--wait
Now we'll deploy the Llama 3.1 model graph with MAX using Helm:
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version 24.6.0 \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/llama-3.1 \
--set maxServe.maxLength=512 \
--set maxServe.maxCacheBatchSize=16 \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--timeout 15m0s \
--wait
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version 24.6.0 \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/llama-3.1 \
--set maxServe.maxLength=512 \
--set maxServe.maxCacheBatchSize=16 \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--timeout 15m0s \
--wait
When you run this command, Helm begins a multi-stage deployment process. First,
it pulls the MAX container image from Docker Hub, which contains the essential
components: the MAX Engine, MAX Serve, and the Llama components. Next, it
downloads the Llama 3.1 GGUF model weights. Finally, it configures and launches
the model as an endpoint, making it accessible on port 8000
. You'll need to
set up port forwarding to access this endpoint.
Verify and test the deployment
After deploying, follow these steps to verify and test your deployment:
- Watch the pod status to ensure it's running:
kubectl get pods -w
kubectl get pods -w
- Check the logs for any startup issues:
kubectl logs -f POD_NAME
kubectl logs -f POD_NAME
- Set up port forwarding to access the service locally:
Get the name of your MAX pod:
POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=max-openai-api-chart,app.kubernetes.io/instance=max-openai-api" -o jsonpath="{.items[0].metadata.name}")
POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=max-openai-api-chart,app.kubernetes.io/instance=max-openai-api" -o jsonpath="{.items[0].metadata.name}")
Then, retrieve the container port that MAX is listening on:
CONTAINER_PORT=$(kubectl get pod $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
CONTAINER_PORT=$(kubectl get pod $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
Finally, set up port forwarding to make MAX accessible on localhost:8000
:
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT &
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT &
Send an inference request
Now that your deployment is verified and port forwarding is set up, you can test the model by sending it a chat request. You will use OpenAI's chat completion endpoint to send the request.
Open a new tab in your terminal and run the following command:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
The following is the expected output:
The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
Monitoring
Once deployed, you can monitor your deployment's health and performance.
The following optional commands will help you monitor your deployment:
- Check pod logs:
kubectl logs -f $POD_NAME
kubectl logs -f $POD_NAME
- Monitor node resources:
kubectl top nodes
kubectl top nodes
- Monitor pod resources:
kubectl top pods
kubectl top pods
- Monitor GPU utilization:
kubectl exec -it $(kubectl get pods --namespace max-openai-api-demo -l app.kubernetes.io/name=max-openai-api-chart -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi
kubectl exec -it $(kubectl get pods --namespace max-openai-api-demo -l app.kubernetes.io/name=max-openai-api-chart -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi
For more information on benchmarking and additional performance metrics, see Benchmark MAX performance.
Cleanup
When you're done testing or need to tear down the environment:
First, uninstall the Helm release:
helm uninstall max-openai-api --namespace max-openai-api-demo
helm uninstall max-openai-api --namespace max-openai-api-demo
Then, delete the Kubernetes namespace:
kubectl delete namespace max-openai-api-demo
kubectl delete namespace max-openai-api-demo
Finally, delete your Kubernetes cluster:
- AWS
- GCP
- Azure
The following command deletes an Amazon EKS cluster and all associated resources in a specified region:
eksctl delete cluster --name max-cluster --region us-east-1
eksctl delete cluster --name max-cluster --region us-east-1
For more information on eksctl delete cluster
, see Delete a cluster.
The following command deletes a GKE cluster and its associated resources in a specified zone:
gcloud container clusters delete max-cluster
gcloud container clusters delete max-cluster
For more information on gcloud container clusters delete
, see Deleting a cluster.
The following command deletes an AKS cluster and its associated resources in a specified resource group:
az aks delete --resource-group my-resource-group --name max-cluster
az aks delete --resource-group my-resource-group --name max-cluster
For more information on az aks delete
, see Delete an Azure Kubernetes Service cluster.
Next steps
You now have a GPU-powered MAX deployment running in the cloud, ready to handle LLM inference at scale with features like optimized GPU utilization, automatic scaling, and robust monitoring. Be sure to monitor performance and costs, and tailor configurations to your specific workload needs.
Keep in mind that this is just a preview of MAX on NVIDIA GPUs. We're working hard to add support for more hardware, including AMD GPUs, and optimize performance for more GenAI models.
To stay up to date with new releases, sign up for our newsletter, check out the community, and join our forum.
And if you're interested in becoming a design partner to get early access and give us feedback, please contact us.
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!