> For the complete documentation index, see [llms.txt](https://docs.modular.com/llms.txt).
> Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

# Deploy MAX on GPU with self-hosted endpoints

In this tutorial, you'll deploy a MAX inference endpoint with Llama 3 from
local testing to production on AWS, GCP, or Azure. You'll learn to serve models
with an OpenAI-compatible endpoint, automate deployment using
Infrastructure-as-Code templates, and optimize performance with GPU
resources—establishing the foundation for production-ready LLM deployments.

MAX provides a streamlined way to deploy large language models (LLMs) with
production-ready features like GPU acceleration, automatic scaling, and
monitoring capabilities. Whether you're building a prototype or preparing for
production deployment, this tutorial will help you set up a robust serving
infrastructure for Llama 3.

And although we're using Llama 3 in these instructions, you can swap it for one
of the hundreds of other LLMs from Hugging Face by browsing our
[supported models](https://docs.modular.com/max/models.md).

The tutorial is organized into the following sections:

- **[Local setup](#local-setup)**: Run Llama 3 locally to verify its basic
functionality.
- **[Cloud deployment](#cloud-deployment)**: Deploy Llama 3 to AWS, GCP, or
Azure using IaC templates and CLI commands.

System requirements:

[Read the requirements](https://docs.modular.com/max/packages.md#system-requirements)

## Local setup

In this section, you will set up and run Llama 3 locally to understand its
capabilities and validate functionality before moving to the cloud. This part
doesn't require a GPU because MAX can also run Llama 3 on CPUs, but we
recommend using a [compatible GPU](https://docs.modular.com/max/faq.md#gpu-requirements) for the best
performance.

### 1. Set up your environment

Create a Python project to install our APIs and CLI tools:

### 2. Serve Llama 3 locally

Next, use the `max` CLI tool to start an endpoint with the Llama 3 model
locally, and ensure that the model runs as expected before deploying it in the
cloud.

:::note

If you want to try a different model, swap the
`modularai/Llama-3.1-8B-Instruct-GGUF` name in all the commands to another
Hugging Face model ID from our [supported models](https://docs.modular.com/max/models.md). Just
be aware that some Hugging Face models require access approval and might have
different memory requirements.

:::

1. Generate a response to a prompt with the following command:

    ```bash
    max generate --model modularai/Llama-3.1-8B-Instruct-GGUF \
      --prompt "What is the meaning of life?" \
      --max-length 250
    ```

2. Start the model server using `max serve`:

    ```bash
    max serve --model modularai/Llama-3.1-8B-Instruct-GGUF
    ```

    This starts a local endpoint with an OpenAI-compatible endpoint. Next,
    we'll send it an inference request.

:::note GPU-enabled Docker containers
We also provide a pre-configured GPU-enabled Docker container that simplifies
deployment. We'll use
the MAX container in the [cloud deployment](#cloud-deployment) steps.
:::

### 3. Test the local endpoint

The endpoint is ready when you see this message in the terminal:

```output
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

Then, you can test its functionality by sending a `curl` request from a new
terminal:

  ```bash
  curl -N http://0.0.0.0:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
          "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Who won the World Series in 2020?"}
          ]
      }' | jq -r '.choices[].message.content'
  ```

You should see output like this:

```output
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
```

To learn more about the supported REST body parameters, see our [API reference
for chat completion](https://docs.modular.com/max/rest-api.md#POST/v1/chat/completions).

Now that the model works locally, we'll transition to cloud deployment.

## Cloud deployment paths {#cloud-deployment}

We will use Infrastructure-as-Code (IaC) to create, configure, and deploy Llama
3 in the cloud. The cloud deployment instructions are divided by provider: AWS,
GCP, and Azure.

### Cloud deployment overview

For AWS, we will use CloudFormation, for GCP, we will use Deployment Manager,
and for Azure, we will use Resource Manager. These IaC templates handle resource
provisioning, networking, and security configuration. This approach simplifies
deployments and ensures they are repeatable.

The key steps are:

- **Create and Deploy Stack/Resources**: Use IaC templates for each cloud
provider to deploy Llama 3.
- **Test the Endpoint**: Retrieve the public IP address after deployment and
send a request to test the Llama 3 endpoint in the cloud.

Each cloud-specific tab provides complete commands for setup, configuration,
deployment, and testing.

To better understand the flow of the deployment, here is a high-level overview
of the architecture:

<figure>
  <img src={require('./images/local-to-cloud/cloud-arch-light.png').default}
       className="light" alt="" width="420" />
  <img src={require('./images/local-to-cloud/cloud-arch-dark.png').default}
       className="dark" alt="" width="420" />
  <figcaption>**Figure 1.** Architecture diagram of the cloud stack for deploying MAX.</figcaption>
</figure>

This architecture diagram illustrates the two-phase deployment setup for serving
the Llama 3 model with MAX on cloud provider infrastructure.

The deployment process is divided into two phases:

- **Phase 1: Cloud stack creation**: In this initial phase, the following
infrastructure is provisioned and configured to prepare for serving requests:
  - **Public IP assignment**: The cloud provider assigns a public IP to the
  virtual machine (VM), allowing it to be accessed externally.
  - **Firewall/Security group configuration**: Security settings, such as
  firewall rules or security groups, are applied to allow traffic on port 80.
  This setup ensures that only HTTP requests can access the instance securely.
  - **GPU compute instance setup**: A GPU-enabled VM is created to handle model
  inference efficiently. This instance includes:
    - **GPU drivers/runtime installation**: Necessary GPU drivers and runtime
      libraries are installed to enable hardware acceleration for model
      processing.
    - **Docker container initialization**: A Docker container is launched on the
  VM, where it pulls the necessary images from the Docker Container Registry.
  This registry serves as a central repository for storing Docker images,
  making it easy to deploy and update the application.

Inside the container, MAX is set up alongside the Llama 3 model. This
setup prepares the environment for serving requests but does not yet expose the
endpoint to users.

:::note GPU-enabled Docker containers
The pre-configured GPU-enabled Docker container includes all necessary
dependencies and configurations for running Llama 3 with GPU acceleration.

The provided IaC templates initialize the MAX container. If you don't use the
provided templates for infrastructure set up, you can initialize the container
image with the `docker run` command. For more information, see
[MAX container](https://docs.modular.com/max/container.md).
:::

- **Phase 2: Serving the user endpoint**: Once the cloud stack is configured and
the VM is set up, the deployment enters the second phase, where it starts
serving user requests:
  - **HTTP endpoint exposure**: With the VM and Docker container ready, the
  system opens an OpenAI compatible HTTP endpoint on port 80, allowing users to
  interact with the deployed Llama 3 model.
  - **Request handling by MAX**: When a user sends an HTTP request to the
  public IP, MAX processes the incoming request within the Docker
  container and forwards it to the Llama 3 model for inference. The model
  generates a response, which is then returned to the user via the endpoint.

:::caution

For the sake of this tutorial, we expose the public IP address of the VM to the
internet. This is not recommended for direct use in production environments as
it may expose your model to security risks.

:::

### Prerequisites

Be sure that you have the following prerequisites, as well as appropriate access
and permissions for the cloud provider of your choice.

- **GPU resources**: You'll need access to GPU resources in your cloud account
with the following specifications:
  - **Minimum GPU memory**: 24GB
  - **Supported GPU types**: [See our compatible
    GPUs](https://docs.modular.com/max/packages.md#gpu-compatibility)

- This tutorial has been tested on the following NVIDIA instances: `g5.4xlarge`
  (A10G) on AWS, `g2-standard-8` (L4) on GCP, and `Standard_NV36ads_A10_v5`
  (A10G) on Azure. It has also been tested on the AMD
  `Standard_ND96isr_MI300X_v5` (MI300X) Azure instance. You can alter the
  provided cloud config files to deploy MAX on any
  [compatible cloud instance or virtual machine](https://docs.modular.com/max/container.md#recommended-cloud-instances).

- **A Hugging Face user access token**: A valid Hugging Face token is required
to access the model. To create a Hugging Face user access token, see
[Access Tokens](https://huggingface.co/settings/tokens). You must make your
token available in your environment with the following command:

  ```bash
  export HF_TOKEN="hf_..."
  ```

- **Docker installation**: Install the
[Docker Engine and CLI](https://docs.docker.com/engine/install/). We use a
pre-configured GPU-enabled Docker container from our public repository. For more
information, check out all of our
[available containers](https://docs.modular.com/max/container.md#container-contents).

- **Cloud CLI tools**: Before deploying, ensure that you have the respective
cloud provider CLI tools installed.
  - [AWS CLI v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
  installed and configured with appropriate credentials
  - [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) installed and
  initialized
  - [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli)
  installed, logged in, and configured

**AWS:**

Configure the AWS CLI:

```bash
aws configure
```

Log in to your AWS account:

```bash
aws sso login
```

Check the credentials via `cat ~/.aws/credentials` to make sure it is set up
correctly. You can also include the credentials as environment variables:

```bash
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
```

---

**GCP:**

Initialize the Google Cloud SDK:

```bash
gcloud init
```

Log in to your Google Cloud account:

```bash
gcloud auth login
```

---

**Azure:**

Initialize the Azure CLI:

```bash
az init
```

Log into your Azure account:

```bash
az login
```

### 1. Create stack/deployment

In this section, we'll walk through creating a deployment stack on AWS, GCP,
and Azure. Each cloud provider has its own configuration steps, detailed below,
but we simplify the setup by using Infrastructure-as-Code (IaC) templates.

Start by cloning the MAX repository and navigating to the
`modular/examples/cloud-configs/` directory, where the necessary IaC
templates and configuration files are organized for each cloud provider.

```bash
git clone -b stable https://github.com/modular/modular && cd modular/examples/cloud-configs
```

This directory includes all files required to deploy MAX to AWS, GCP, or Azure:

:::note AMD GPU cloud deployment
Azure provides AMD GPU virtual machines. If you want to deploy MAX with AMD GPUs
on Azure, you can use the
`modular/examples/cloud-configs/azure/amd/max-amd-azure.json` config file. This
file defines the appropriate image and settings for AMD-based inference
workloads on an
[ND MI300X v5 series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndmi300xv5-series)
vm.
:::

```bash
max/examples/cloud-configs/
├── aws
│   ├── max-nvidia-aws.yaml
│   └── notify.sh
├── azure
│   ├── amd
│   │   ├── max-amd-azure.json
│   │   └── notify.sh
│   ├── nvidia
│   │   ├── max-nvidia-azure.json
│   │   └── notify.sh
└── gcp
    ├── max-nvidia-gcp.jinja
    └── notify.sh
```

With these IaC templates ready, choose your preferred cloud provider and follow
the step-by-step instructions specific to each platform.

:::note Preparing the deployment takes some time

Stack creation may take some time to complete and completion times differ across
cloud providers.
:::

**AWS:**

First navigate to the AWS directory:

```bash
cd aws
```

Set the region in your environment:

```bash
export REGION="REGION" # example: `us-east-1`
```

Then, create the stack. You can explore the `max-nvidia-aws.yaml` file for AWS
CloudFormation configuration information.

:::note Stack naming

The stack name must be **unique** so please be sure to change the `--stack-name`
if you create multiple stacks.

:::

```bash
export STACK_NAME="max-serve-stack"

aws cloudformation create-stack --stack-name ${STACK_NAME} \
 --template-body file://max-nvidia-aws.yaml \
 --parameters \
   ParameterKey=InstanceType,ParameterValue=g5.4xlarge \
   ParameterKey=HuggingFaceHubToken,ParameterValue=${HF_TOKEN} \
   ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/Llama-3.1-8B-Instruct-GGUF \
 --capabilities CAPABILITY_IAM \
 --region $REGION
```

---

**GCP:**

:::note GCP access requirements

You must have access to `deploymentmanager.googleapis.com`,
`logging.googleapis.com`, `compute.googleapis.com` and be able to use
`gcloud compute firewall-rules` to configure inbound traffic.
:::

First, navigate to the GCP directory:

```bash
cd gcp
```

Set the project ID:

```bash
PROJECT_ID="YOUR PROJECT ID"
export ZONE="ZONE" # example `us-east1-d`
```

Enable the required APIs:

```bash
gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable compute.googleapis.com --project=${PROJECT_ID}
```

Create the deployment with the following command. You can explore the
`max-nvidia-gcp.jinja` file for more information on the Deployment Manager
configuration.

:::note Deployment naming

The deployment name must be **unique** so please be sure to change the
`DEPLOYMENT_NAME` if you create multiple deployments.

:::

```bash
export DEPLOYMENT_NAME="max-serve-deployment"
export INSTANCE_NAME="max-serve-instance"

gcloud deployment-manager deployments create ${DEPLOYMENT_NAME} \
  --template max-nvidia-gcp.jinja \
  --properties "\
instanceName:${INSTANCE_NAME},\
zone:${ZONE},\
machineType:g2-standard-8,\
acceleratorType:nvidia-l4,\
acceleratorCount:1,\
sourceImage:common-cu123-v20240922-ubuntu-2204-py310,\
huggingFaceHubToken:${HF_TOKEN},\
huggingFaceRepoId:modularai/Llama-3.1-8B-Instruct-GGUF" \
  --project ${PROJECT_ID}
```

---

**Azure (Nvidia):**

First, navigate to the Azure directory:

```bash
cd azure/nvidia
```

Set the region:

```bash
export REGION="REGION" # example `westus3`
```

Then, create the resource group:

:::note Resource group and deployment naming

If you receive an error about resource group location conflicts, it means the
resource group already exists in a different location.

You can either:

- Use a new resource group name
- Use the existing resource group's location

Additionally, the deployment name must be **unique** so please be sure to change
the `DEPLOYMENT_NAME` if you create multiple deployments.

:::

```bash
export RESOURCE_GROUP_NAME="maxServeResourceGroup"
export DEPLOYMENT_NAME="maxServeDeployment"
az group create --name ${RESOURCE_GROUP_NAME} --location ${REGION}
```

Check the status of the resource group:

```bash
az group show -n ${RESOURCE_GROUP_NAME} --query properties.provisioningState -o tsv
```

Create and encode the startup script:

```bash
STARTUP_SCRIPT='#!/bin/bash

sudo usermod -aG docker $USER

sudo systemctl restart docker

sleep 10

HF_TOKEN=$1
HUGGING_FACE_REPO_ID=${2:-modularai/Llama-3.1-8B-Instruct-GGUF}

sudo docker run -d \
  --restart unless-stopped \
  --env "HF_TOKEN=${HF_TOKEN}" \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/max_cache:/opt/venv/share/max/.max_cache \
  --gpus 1 \
  -p 80:8000 \
  --ipc=host \
  modular/max-nvidia-full:latest \
  --model ${HUGGING_FACE_REPO_ID}'

export STARTUP_SCRIPT=$(echo "$STARTUP_SCRIPT" | base64)
```

Then, create the deployment:

:::note NVIDIA license agreement

You may be required to accept the Azure Marketplace image terms for the NVIDIA
AI enterprise image:

```bash
az vm image terms accept --urn nvidia:nvidia-ai-enterprise:nvaie_gpu_1_gen2:latest
```

:::

:::caution Set an admin password

Replace `YOUR-SECURE-PASSWORD-123` with your own secure password to be able to
`ssh` into the VM that we will use later.

:::

```bash
export VM_PASSWORD="YOUR-SECURE-PASSWORD-123"

az deployment group create \
    --name ${DEPLOYMENT_NAME} \
    --resource-group ${RESOURCE_GROUP_NAME} \
    --template-file max-nvidia-azure.json \
    --parameters \
        adminUsername="azureuser" \
        adminPassword=${VM_PASSWORD} \
        vmSize="Standard_NV36ads_A10_v5" \
        osDiskSizeGB=128 \
        vnetAddressPrefix="10.0.0.0/16" \
        subnetAddressPrefix="10.0.0.0/24" \
        startupScript="${STARTUP_SCRIPT}" \
        location="${REGION}"
```

---

**Azure (AMD):**

First, navigate to the Azure directory:

```bash
cd azure/amd
```

Set the region:

```bash
export REGION="REGION" # example `westus3`
```

Then, create the resource group:

:::note Resource group and deployment naming

If you receive an error about resource group location conflicts, it means the
resource group already exists in a different location.

You can either:

- Use a new resource group name
- Use the existing resource group's location

Additionally, the deployment name must be **unique** so please be sure to change
the `DEPLOYMENT_NAME` if you create multiple deployments.

:::

```bash
export RESOURCE_GROUP_NAME="maxServeResourceGroup"
export DEPLOYMENT_NAME="maxServeDeployment"
az group create --name ${RESOURCE_GROUP_NAME} --location ${REGION}
```

Check the status of the resource group:

```bash
az group show -n ${RESOURCE_GROUP_NAME} --query properties.provisioningState -o tsv
```

Create and encode the startup script:

```bash
STARTUP_SCRIPT='#!/bin/bash

sudo usermod -aG docker $USER

sudo systemctl restart docker

sleep 10

HF_TOKEN=$1
HUGGING_FACE_REPO_ID=${2:-modularai/Llama-3.1-8B-Instruct-GGUF}

sudo docker run -d \
  --restart unless-stopped \
  --env "HF_TOKEN=${HF_TOKEN}" \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/max_cache:/opt/venv/share/max/.max_cache \
  -p 80:8000 \
  --ipc=host \
  --device /dev/kfd \
  --device /dev/dri \
  modular/max-amd:latest \
  --model ${HUGGING_FACE_REPO_ID}'

export STARTUP_SCRIPT=$(echo "$STARTUP_SCRIPT" | base64)
```

Then, create the deployment:

:::caution Set an admin password

Replace `YOUR-SECURE-PASSWORD-123` with your own secure password to be able to
`ssh` into the VM that we will use later.

:::

```bash
export VM_PASSWORD="YOUR-SECURE-PASSWORD-123"

az deployment group create \
    --name ${DEPLOYMENT_NAME} \
    --resource-group ${RESOURCE_GROUP_NAME} \
    --template-file max-amd-azure.json \
    --parameters \
        adminUsername="azureuser" \
        adminPassword=${VM_PASSWORD} \
        vmSize="Standard_ND96isr_MI300X_v5" \
        osDiskSizeGB=256 \
        vnetAddressPrefix="10.0.0.0/16" \
        subnetAddressPrefix="10.0.0.0/24" \
        startupScript="${STARTUP_SCRIPT}" \
        location="${REGION}"
```

### 2. Wait for resources to be ready

In this step, we'll wait for the resources to be ready. Stack and deployment
creation may take some time to complete.

**AWS:**

```bash
aws cloudformation wait stack-create-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
```

---

**GCP:**

```bash
gcloud deployment-manager deployments describe ${DEPLOYMENT_NAME} \
--project=${PROJECT_ID}
```

---

**Azure:**

Wait for the deployment to be completed and report its status:

```bash
az deployment group wait \
--name ${DEPLOYMENT_NAME} \
--resource-group ${RESOURCE_GROUP_NAME} \
--created
```

### 3. Retrieve instance information

After the resources are deployed, you'll need to get the instance information,
such as the public DNS or IP address that we will use to test the endpoint.

**AWS:**

```bash
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \
  --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \
  --output text \
  --region ${REGION})
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text \
  --region ${REGION})
echo "Instance ID: ${INSTANCE_ID}"
echo "Public IP: ${PUBLIC_IP}"
aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION}
```

---

**GCP:**

First, check if the firewall rule already exists:

```bash
EXISTING_RULE=$(gcloud compute firewall-rules list \
  --filter="name=allow-http" \
  --format="value(name)" \
  --project=${PROJECT_ID})

if [ -z "$EXISTING_RULE" ]; then
  echo "Creating firewall rule..."
  gcloud compute firewall-rules create allow-http \
    --allow tcp:80 \
    --source-ranges 0.0.0.0/0 \
    --target-tags http-server \
    --description "Allow HTTP traffic on port 80" \
    --project=${PROJECT_ID}
else
  echo "Firewall rule 'allow-http' already exists"
fi
```

Check if the instance exists and tag it with `http-server`:

```bash
INSTANCE_EXISTS=$(gcloud compute instances list \
  --filter="name=${INSTANCE_NAME}" \
  --format="value(name)" \
  --project=${PROJECT_ID})

if [ -n "$INSTANCE_EXISTS" ]; then
  echo "Adding tags to instance ${INSTANCE_NAME}"
  gcloud compute instances add-tags "${INSTANCE_NAME}" \
    --project=${PROJECT_ID} \
    --zone "${ZONE}" \
    --tags http-server
else
  echo "Error: Instance ${INSTANCE_NAME} not found"
  exit 1
fi
```

Then, get the public IP:

```bash
PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \
  --zone "${ZONE}" \
  --format="get(networkInterfaces[0].accessConfigs[0].natIP)" \
  --project=${PROJECT_ID})
echo "Public IP: $PUBLIC_IP"
```

---

**Azure:**

```bash
PUBLIC_IP=$(az network public-ip show \
--resource-group ${RESOURCE_GROUP_NAME} \
--name maxServePublicIP \
--query ipAddress -o tsv)
echo "Public IP: ${PUBLIC_IP}"
```

### 4. Test the endpoint

1. Wait until the server is ready to test the endpoint

   It will take some time for the stack or deployment to pull the MAX Docker
   image and set it up for serving. We need to wait for the Docker logs to
   appear and then make sure that the Docker container is running on port
   `8000`.

   The server is ready when you see the following log:

    ```output
    Server ready on http://0.0.0.0:8000
    ```

   We provide a simple script to monitor the startup progress and notify you
   when the server is ready.

   **AWS:**

For AWS, you can see the logs in the AWS CloudWatch UI within the log group
   `/aws/ec2/${STACK_NAME}-logs` and log stream `instance-logs`.

   Alternatively, you can use the provided bash script to monitor the logs until
   the server is ready:

    ```bash
    bash notify.sh ${REGION} ${STACK_NAME} ${PUBLIC_IP}
    ```

---

**GCP:**

For GCP, first make sure that the Docker container is running on port `8000`.

   You can view the logs in the Compute Engine VM instances UI. Within the UI,
   choose **Observability**, then choose **Logs**.

   Alternatively, you can use the provided bash script to monitor the logs until
   the server is ready:

    ```bash
    bash notify.sh ${PROJECT_ID} ${INSTANCE_NAME} ${ZONE} ${PUBLIC_IP}
    ```

---

**Azure:**

For Azure, you can monitor the Docker container status (running on port
   `8000`) using one of the following methods:

   **Option 1: Use the monitoring script:**

    1. Install the required dependencies for the monitoring script:
       - Install
   [sshpass](https://www.cyberciti.biz/faq/noninteractive-shell-script-ssh-password-provider/)
   on your local machine to enable automated SSH password authentication

    2. Set up and run the monitoring script:

       ```bash
       bash notify.sh ${RESOURCE_GROUP_NAME} ${VM_PASSWORD} ${PUBLIC_IP}
       ```

    **Option 2: Manual SSH access:**

    1. Connect to the VM:

       ```bash
       ssh azureuser@$PUBLIC_IP
       ```

       > **Note:** Use the password you set previously when creating the deployment.

    2. View the startup logs:

       ```bash
       sudo cat /var/log/azure/custom-script/handler.log
       sudo cat /var/lib/waagent/custom-script/download/0/stdout
       sudo cat /var/lib/waagent/custom-script/download/0/stderr
       sudo docker logs $(docker ps -q -f ancestor=modular/max-nvidia-full:latest)
       ```

       > **Note:** Use the container name `modular/max-amd:latest` if you deployed MAX on an AMD instance.

    Both methods will help you confirm that the server is running correctly. The
    logs will show the startup progress and any potential issues that need to be
    addressed.

2. When the server is ready, use the public IP address that we obtained from
the previous step to test the endpoint with the following `curl` request:

    :::tip

    After the server starts, there may be a brief delay before the cloud
    provider exposes the public IP address. If you receive an error, please
    wait approximately one minute and try again.

    :::

    ```bash
    curl -N http://$PUBLIC_IP/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
            "stream": true,
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Who won the World Series in 2020?"}
            ]
        }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
    ```

:::note Benchmarking MAX

You can also use the public IP address of your deployed MAX endpoint to
benchmark the performance of Llama 3.1. MAX includes a benchmarking script that
allows you to evaluate throughput, latency, and GPU utilization metrics. For
more detailed instructions on benchmarking, see the
[`max benchmark` docs](https://docs.modular.com/max/cli/benchmark.md).

:::

### 5. Delete the cloud resources

Cleaning up resources to avoid unwanted costs is critical. Use the following
commands to delete resources for each platform. This section provides steps to
safely terminate all resources used in the tutorial.

**AWS:**

First, delete the stack:

```bash
aws cloudformation delete-stack --stack-name ${STACK_NAME}
```

Wait for the stack to be deleted:

```bash
aws cloudformation wait stack-delete-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
```

---

**GCP:**

```bash
gcloud deployment-manager deployments delete ${DEPLOYMENT_NAME} \
--project=${PROJECT_ID}
```

---

**Azure:**

```bash
az group delete --name ${RESOURCE_GROUP_NAME}
```

### Cost estimate

When deploying Llama 3 in a cloud environment, several cost factors come into
play:

**Primary cost components:**

- **Compute Resources**: GPU instances (like AWS `g5.4xlarge`, GCP
`g2-standard-8`, or Azure `Standard_NV36ads_A10_v5`) form the bulk of the costs
- **Network Transfer**: Costs associated with data ingress/egress, which is
critical for high-traffic applications
- **Storage**: Expenses for boot volumes and any additional storage requirements
- **Additional Services**: Costs for logging, monitoring, and other supporting
cloud services

For detailed cost estimates specific to your use case, we recommend using these
official pricing calculators:

- [AWS Pricing Calculator](https://calculator.aws)
- [GCP Pricing Calculator](https://cloud.google.com/products/calculator)
- [Azure Pricing Calculator](https://azure.microsoft.com/en-us/pricing/calculator/)

:::tip

Cloud cost optimization tips:

- Consider using spot/preemptible instances for non-critical workloads
- Implement auto-scaling to match resource allocation with demand
- Monitor and optimize network usage patterns
- Set up cost alerts and budgets to avoid unexpected charges

Remember to factor in your expected usage patterns, regional pricing
differences, and any applicable enterprise discounts when calculating total cost
of ownership (TCO).

:::

## Next steps

Congratulations on successfully running MAX Pipelines locally and deploying
Llama 3 to the cloud! 🎉

To stay up to date with new releases,
[join our community](https://www.modular.com/community). And if you're
interested in becoming a design partner to get early access and give us
feedback, please [contact us](https://www.modular.com/request-demo).
