Skip to main content
Log in

Deploy Llama 3 on GPU-powered Kubernetes clusters

MAX simplifies the process to deploy an LLM with high performance on GPUs. And if you want to deploy at scale using Kubernetes' built-in monitoring, scaling, and cluster management, then you're in the right place. In this tutorial, you'll learn how to deploy our MAX container on Kubernetes, using your pick of AWS, GCP, or Azure.

In this tutorial, you'll deploy Llama 3.1 as a containerized service using MAX Serve, our GPU-optimized inference server, running on a Kubernetes cluster. You'll create a GPU-enabled Kubernetes cluster with your chosen cloud provider (AWS, GCP, or Azure), then use Helm to deploy the MAX Serve container, which provides a REST API endpoint for making inference requests to Llama 3.1.

MAX Serve supports the following NVIDIA GPU instances:

  • A100
  • A10
  • L4
  • L40

While this tutorial uses A100 instances, you can use any of the supported instances with minor modifications to the CLI commands.

Install the required tools

Most of this tutorial involves interaction with your cloud service, so make sure you have the appropriate access and permissions. Most importantly, this tutorial uses GPU-powered Kubernetes clusters that may require special privileges.

To get started, select your cloud provider below and install the corresponding required tools.

To work with AWS, you'll need to install and configure two command-line tools. Begin by installing the AWS CLI using the AWS CLI installation guide, then install eksctl following the eksctl installation guide.

After installation, authenticate your AWS account using:

aws configure
aws configure

This will prompt you for your AWS credentials. For a complete setup walkthrough, refer to the AWS authentication guide or the Amazon EKS setup documentation.

  1. Install required tools:

Now that you have the prerequisites out of the way, you can create a Kubernetes cluster with GPU nodes on your preferred cloud provider.

Create a Kubernetes cluster with GPU nodes

To get started, you'll need a Kubernetes cluster equipped with GPU nodes to handle the compute demands of LLM inference. We recommend using NVIDIA's A100 instances for their high performance and efficiency in AI workloads.

Run the following command to create a cluster with an full OpenID Connect (OIDC) provider for authentication, private networking, full Elastic Container Registry (ECR) access, and multi-zone deployment:

eksctl create cluster \
--name max-cluster \
--region us-east-1 \
--node-type p4d.24xlarge \
--nodes 1
eksctl create cluster \
--name max-cluster \
--region us-east-1 \
--node-type p4d.24xlarge \
--nodes 1

For more information on eksctl create cluster, see Create an Amazon EKS Cluster.

Set up a Kubernetes namespace

Next, we'll create a dedicated namespace:

kubectl create namespace max-openai-api-demo
kubectl create namespace max-openai-api-demo

Then set this namespace as our default:

kubectl config set-context --current --namespace=max-openai-api-demo
kubectl config set-context --current --namespace=max-openai-api-demo

Deploy using Helm

Now we'll deploy the Llama 3.1 model graph with MAX using Helm:

helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version 24.6.0 \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/llama-3.1 \
--set maxServe.maxLength=512 \
--set maxServe.maxCacheBatchSize=16 \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--timeout 15m0s \
--wait
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version 24.6.0 \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/llama-3.1 \
--set maxServe.maxLength=512 \
--set maxServe.maxCacheBatchSize=16 \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--timeout 15m0s \
--wait

When you run this command, Helm begins a multi-stage deployment process. First, it pulls the MAX container image from Docker Hub, which contains the essential components: the MAX Engine, MAX Serve, and the Llama components. Next, it downloads the Llama 3.1 GGUF model weights. Finally, it configures and launches the model as an endpoint, making it accessible on port 8000. You'll need to set up port forwarding to access this endpoint.

Verify and test the deployment

After deploying, follow these steps to verify and test your deployment:

  1. Watch the pod status to ensure it's running:
kubectl get pods -w
kubectl get pods -w
  1. Check the logs for any startup issues:
kubectl logs -f POD_NAME
kubectl logs -f POD_NAME
  1. Set up port forwarding to access the service locally:

Get the name of your MAX pod:

POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=max-openai-api-chart,app.kubernetes.io/instance=max-openai-api" -o jsonpath="{.items[0].metadata.name}")
POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=max-openai-api-chart,app.kubernetes.io/instance=max-openai-api" -o jsonpath="{.items[0].metadata.name}")

Then, retrieve the container port that MAX is listening on:

CONTAINER_PORT=$(kubectl get pod $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
CONTAINER_PORT=$(kubectl get pod $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")

Finally, set up port forwarding to make MAX accessible on localhost:8000:

kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT &
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT &

Send an inference request

Now that your deployment is verified and port forwarding is set up, you can test the model by sending it a chat request. You will use OpenAI's chat completion endpoint to send the request.

Open a new tab in your terminal and run the following command:

curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'

The following is the expected output:

The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.

Monitoring

Once deployed, you can monitor your deployment's health and performance.

The following optional commands will help you monitor your deployment:

  • Check pod logs:
kubectl logs -f $POD_NAME
kubectl logs -f $POD_NAME
  • Monitor node resources:
kubectl top nodes
kubectl top nodes
  • Monitor pod resources:
kubectl top pods
kubectl top pods
  • Monitor GPU utilization:
kubectl exec -it $(kubectl get pods --namespace max-openai-api-demo -l app.kubernetes.io/name=max-openai-api-chart -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi
kubectl exec -it $(kubectl get pods --namespace max-openai-api-demo -l app.kubernetes.io/name=max-openai-api-chart -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi

For more information on benchmarking and additional performance metrics, see Benchmark MAX performance.

Cleanup

When you're done testing or need to tear down the environment:

First, uninstall the Helm release:

helm uninstall max-openai-api --namespace max-openai-api-demo
helm uninstall max-openai-api --namespace max-openai-api-demo

Then, delete the Kubernetes namespace:

kubectl delete namespace max-openai-api-demo
kubectl delete namespace max-openai-api-demo

Finally, delete your Kubernetes cluster:

The following command deletes an Amazon EKS cluster and all associated resources in a specified region:

eksctl delete cluster --name max-cluster --region us-east-1
eksctl delete cluster --name max-cluster --region us-east-1

For more information on eksctl delete cluster, see Delete a cluster.

Next steps

You now have a GPU-powered MAX deployment running in the cloud, ready to handle LLM inference at scale with features like optimized GPU utilization, automatic scaling, and robust monitoring. Be sure to monitor performance and costs, and tailor configurations to your specific workload needs.

Keep in mind that this is just a preview of MAX on NVIDIA GPUs. We're working hard to add support for more hardware, including AMD GPUs, and optimize performance for more GenAI models.

To stay up to date with new releases, sign up for our newsletter, check out the community, and join our forum.

And if you're interested in becoming a design partner to get early access and give us feedback, please contact us.

Did this tutorial work for you?