Skip to main content
Log in

Deploy Llama 3 on GPU with MAX Serve

This guide walks through serving Llama 3 models with MAX Serve, from local testing to production deployment on major cloud platforms. You'll learn to automate the deployment process using Infrastructure-as-Code (Iac) and optimize performance with GPU resources.

MAX Serve provides a streamlined way to deploy large language models (LLMs) with production-ready features like GPU acceleration, automatic scaling, and monitoring capabilities. Whether you're building a prototype or preparing for production deployment, this guide will help you set up a robust serving infrastructure for Llama 3.

The tutorial is organized into the following sections:

  • Local setup: Run Llama 3 locally to verify its basic functionality.
  • Cloud deployment: Deploy Llama 3 to AWS, GCP, or Azure using IaC templates and CLI commands.

Local setup

In this section, you will set up and run Llama 3 locally to understand its capabilities and validate functionality before moving to the cloud.

1. Clone the MAX repository

To get started, let's clone the MAX repository and navigate to the appropriate directory:

git clone https://github.com/modularml/max && cd max/pipelines/python
git clone https://github.com/modularml/max && cd max/pipelines/python

2. Run Llama 3 locally

Next, use the magic CLI tool to interact with the Llama 3 model locally and ensure that the model runs as expected before deploying it in the cloud.

  1. Generate a response to a prompt with the following command:
magic run llama3 --prompt "What is the meaning of life?"
magic run llama3 --prompt "What is the meaning of life?"
  1. Start the model server using magic run serve. The --huggingface-repo-id flag specifies which model to load.
magic run serve --huggingface-repo-id modularai/llama-3.1
magic run serve --huggingface-repo-id modularai/llama-3.1

This starts a local server where you can test Llama 3's response generation capabilities. You should see the following log:

Uvicorn running on http://0.0.0.0:8000
Uvicorn running on http://0.0.0.0:8000

For more information on MAX-optimized models, see the Python Pipelines README.

3. Test the local endpoint

After starting the model server, you can test its functionality by sending a curl request:

curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

After starting your server, you can go to http://0.0.0.0:8000/docs to learn more about available endpoints and API specifications.

Now that the model works locally, we'll transition to cloud deployment.

Cloud deployment paths

We will use Infrastructure-as-Code (IaC) to create, configure, and deploy Llama 3 in the cloud. The cloud deployment instructions are divided by provider: AWS, GCP, and Azure.

Cloud deployment overview

For AWS, we will use CloudFormation, for GCP, we will use Deployment Manager, and for Azure, we will use Resource Manager. These IaC templates handle resource provisioning, networking, and security configuration. This approach simplifies deployments and ensures they are repeatable.

The key steps are:

  • Create and Deploy Stack/Resources: Use IaC templates for each cloud provider to deploy Llama 3.
  • Test the Endpoint: Retrieve the public IP address after deployment and send a request to test the Llama 3 endpoint in the cloud.

Each cloud-specific tab provides complete commands for setup, configuration, deployment, and testing.

To better understand the flow of the deployment, here is a high-level overview of the architecture:

Figure 1. Architecture diagram of the cloud stack for deploying MAX Serve.

This architecture diagram illustrates the two-phase deployment setup for serving the Llama 3 model with MAX on cloud provider infrastructure.

The deployment process is divided into two phases:

  • Phase 1: Cloud stack creation: In this initial phase, the following infrastructure is provisioned and configured to prepare for serving requests:
    • Public IP assignment: The cloud provider assigns a public IP to the virtual machine (VM), allowing it to be accessed externally.
    • Firewall/Security group configuration: Security settings, such as firewall rules or security groups, are applied to allow traffic on port 80. This setup ensures that only HTTP requests can access the instance securely.
    • GPU compute instance setup: A GPU-enabled VM is created to handle model inference efficiently. This instance includes:
      • GPU drivers/runtime installation: Necessary GPU drivers and runtime libraries are installed to enable hardware acceleration for model processing.
      • Docker container initialization: A Docker container is launched on the VM, where it pulls the necessary images from the Docker Container Registry. This registry serves as a central repository for storing Docker images, making it easy to deploy and update the application.

Inside the container, MAX Serve is set up alongside the Llama 3 model. This setup prepares the environment for serving requests but does not yet expose the endpoint to users.

  • Phase 2: Serving the user endpoint: Once the cloud stack is configured and the VM is set up, the deployment enters the second phase, where it starts serving user requests:
    • HTTP endpoint exposure: With the VM and Docker container ready, the system opens an OpenAI compatible HTTP endpoint on port 80, allowing users to interact with the deployed Llama 3 model.
    • Request handling by MAX Serve: When a user sends an HTTP request to the public IP, MAX Serve processes the incoming request within the Docker container and forwards it to the Llama 3 model for inference. The model generates a response, which is then returned to the user via the endpoint.

Prerequisites

Be sure that you have the following prerequisites, as well as appropriate access and permissions for the cloud provider of your choice.

  • GPU resources: You'll need access to GPU resources in your cloud account with the following specifications:
    • Minimum GPU memory: 24GB
    • Supported GPU types: NVIDIA A100 (most optimized), A10G, L4 and L40
  • A Hugging Face user access token: A valid Hugging Face token is required to access the model. To create a Hugging Face user access token, see Access Tokens. You must make your token available in your environment with the following command:

    export HUGGING_FACE_HUB_TOKEN="<YOUR-HUGGING-FACE-HUB-TOKEN>"
    export HUGGING_FACE_HUB_TOKEN="<YOUR-HUGGING-FACE-HUB-TOKEN>"
  • Docker installation: Install the Docker Engine and CLI. We use a pre-configured GPU-enabled Docker container from our public repository. The container image (modular/max-openai-api:24.6.0) is available on Docker Hub. For more information, see MAX container.

  • Cloud CLI tools: Before deploying, ensure that you have the respective cloud provider CLI tools installed.

Configure the AWS CLI:

aws configure
aws configure

Log in to your AWS account:

aws sso login
aws sso login

Check the credentials via cat ~/.aws/credentials to make sure it is set up correctly. You can also include the credentials as environment variables:

export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"

1. Create stack/deployment

In this section, we'll walk through creating a deployment stack on AWS, GCP, and Azure. Each cloud provider has its own configuration steps, detailed below, but we simplify the setup by using Infrastructure-as-Code (IaC) templates.

Start by navigating to the max/tutorials/max-serve-cloud-configs/ directory, where the necessary IaC templates and configuration files are organized for each cloud provider. This directory includes all files required to deploy the MAX Serve setup to AWS, GCP, or Azure:

max/tutorials/max-serve-cloud-configs/
├── aws
│ ├── max-serve-aws.yaml
│ └── notify.sh
├── azure
│ ├── max-serve-azure.json
│ └── notify.sh
└── gcp
├── max-serve-gcp.jinja
└── notify.sh
max/tutorials/max-serve-cloud-configs/
├── aws
│ ├── max-serve-aws.yaml
│ └── notify.sh
├── azure
│ ├── max-serve-azure.json
│ └── notify.sh
└── gcp
├── max-serve-gcp.jinja
└── notify.sh

With these IaC templates ready, choose your preferred cloud provider and follow the step-by-step instructions specific to each platform.

First navigate to the AWS directory:

cd aws
cd aws

Set the region in your environment:

export REGION="REGION" # example: `us-east-1`
export REGION="REGION" # example: `us-east-1`

Then, create the stack. You can explore the max-serve-aws.yaml file for AWS CloudFormation configuration information.

export STACK_NAME="max-serve-stack"

aws cloudformation create-stack --stack-name ${STACK_NAME} \
--template-body file://max-serve-aws.yaml \
--parameters \
ParameterKey=InstanceType,ParameterValue=g5.4xlarge \
ParameterKey=HuggingFaceHubToken,ParameterValue=${HUGGING_FACE_HUB_TOKEN} \
ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/llama-3.1 \
--capabilities CAPABILITY_IAM \
--region $REGION
export STACK_NAME="max-serve-stack"

aws cloudformation create-stack --stack-name ${STACK_NAME} \
--template-body file://max-serve-aws.yaml \
--parameters \
ParameterKey=InstanceType,ParameterValue=g5.4xlarge \
ParameterKey=HuggingFaceHubToken,ParameterValue=${HUGGING_FACE_HUB_TOKEN} \
ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/llama-3.1 \
--capabilities CAPABILITY_IAM \
--region $REGION

2. Wait for resources to be ready

In this step, we'll wait for the resources to be ready. Stack and deployment creation may take some time to complete.

aws cloudformation wait stack-create-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
aws cloudformation wait stack-create-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}

3. Retrieve instance information

After the resources are deployed, you'll need to get the instance information, such as the public DNS or IP address that we will use to test the endpoint.

INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \
--query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \
--output text \
--region ${REGION})
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
--query 'Reservations[0].Instances[0].PublicIpAddress' \
--output text \
--region ${REGION})
echo "Instance ID: ${INSTANCE_ID}"
echo "Public IP: ${PUBLIC_IP}"
aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION}
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \
--query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \
--output text \
--region ${REGION})
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
--query 'Reservations[0].Instances[0].PublicIpAddress' \
--output text \
--region ${REGION})
echo "Instance ID: ${INSTANCE_ID}"
echo "Public IP: ${PUBLIC_IP}"
aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION}

4. Test the endpoint

We will use the public IP address that we obtained from previous step to test the endpoint with the following curl request:

curl -N http://$PUBLIC_IP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}'
curl -N http://$PUBLIC_IP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/llama-3.1",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"}
]
}'

5. Delete the cloud resources

Cleaning up resources to avoid unwanted costs is critical. Use the following commands to delete resources for each platform. This section provides steps to safely terminate all resources used in the tutorial.

First, delete the stack:

aws cloudformation delete-stack --stack-name ${STACK_NAME}
aws cloudformation delete-stack --stack-name ${STACK_NAME}

Wait for the stack to be deleted:

aws cloudformation wait stack-delete-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
aws cloudformation wait stack-delete-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}

Cost estimate

When deploying Llama 3 in a cloud environment, several cost factors come into play:

Primary cost components:

  • Compute Resources: GPU instances (like AWS g5.4xlarge, GCP g2-standard-8, or Azure Standard_NV36ads_A10_v5) form the bulk of the costs
  • Network Transfer: Costs associated with data ingress/egress, which is critical for high-traffic applications
  • Storage: Expenses for boot volumes and any additional storage requirements
  • Additional Services: Costs for logging, monitoring, and other supporting cloud services

For detailed cost estimates specific to your use case, we recommend using these official pricing calculators:

Next steps

Congratulations on successfully running MAX Pipelines locally and deploying Llama 3 to the cloud! 🎉

Now that you've mastered the essentials of setting up and deploying the Llama 3 model with MAX Serve, here are some other topics to explore next:

Did this tutorial work for you?