Serverless GPU inference on Google Cloud Run

Technical Writer

gcp

cloud-run

gpu

Google Cloud Run is a fully managed compute platform that lets you run any container, making it a great option for deploying an AI endpoint with MAX. This tutorial guides you through the process of deploying Llama 3 with MAX container on Google Cloud Run, so you get automatic scaling and serverless deployment without managing any of the infrastructure yourself.

Requirements

Before starting this tutorial, ensure that you have:

A Google Cloud account with billing enabled
The gcloud CLI tool installed and initialized

Also make sure your Google Cloud project has access to the necessary quotas and system limits. For more information on compatible GPUs, see GCP's supported GPU types.

We recommend the following hardware resources:

GPU: NVIDIA L4 (or another compatible GPU)
CPU: 8 vCPUs
Memory: At least 32 GiB

Deploy MAX to Cloud Run

This section guides you through deploying the MAX container for Llama 3.1 inference on Google Cloud Run with GPU acceleration.

Before deploying, set up the required environment variables, including your Google Cloud project ID and a supported region for Cloud Run with GPUs.

Because Cloud Run with GPUs is in public preview, you should use a separate project for your GPU services, and not the same project that contains your other production workloads.
```
export PROJECT_ID="your-project-id"
export REGION="us-central1"
```
```
export PROJECT_ID="your-project-id"
export REGION="us-central1"
```

To use Cloud Run and Cloud Build, you must enable the necessary APIs:

gcloud services enable \
  run.googleapis.com \
  cloudbuild.googleapis.com
gcloud services enable \
  run.googleapis.com \
  cloudbuild.googleapis.com

Now, deploy the MAX container to Cloud Run using the following command:

gcloud beta run deploy max-nvidia-full \
  --image=modular/max-nvidia-full \
  --region=us-central1 \
  --platform=managed \
  --memory=32Gi \
  --cpu=8 \
  --timeout=1200 \
  --port=8000 \
  --min-instances=1 \
  --max-instances=5 \
  --concurrency=5 \
  --cpu-boost \
  --args="--model-path=modularai/Llama-3.1-8B-Instruct-GGUF" \
  --allow-unauthenticated \
  --gpu=1 \
  --gpu-type=nvidia-l4 \
  --set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \
  --startup-probe=tcpSocket.port=8000,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=5
gcloud beta run deploy max-nvidia-full \
  --image=modular/max-nvidia-full \
  --region=us-central1 \
  --platform=managed \
  --memory=32Gi \
  --cpu=8 \
  --timeout=1200 \
  --port=8000 \
  --min-instances=1 \
  --max-instances=5 \
  --concurrency=5 \
  --cpu-boost \
  --args="--model-path=modularai/Llama-3.1-8B-Instruct-GGUF" \
  --allow-unauthenticated \
  --gpu=1 \
  --gpu-type=nvidia-l4 \
  --set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \
  --startup-probe=tcpSocket.port=8000,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=5

This last command deploys a Google Cloud Run service named max-nvidia-full using the modular/max-nvidia-full container image, allocating 32Gi of memory, 8 CPUs, and 1 NVIDIA L4 GPU in the us-central1 region with autoscaling between 1 and 5 instances. The --concurrency=5 flag limits each instance to handling a maximum of 5 concurrent requests, triggering a new instance if the limit is exceeded.

You can adjust the maximum concurrent requests to balance throughput, latency, and cost. Lower --concurrency values reduce latency but require more instances, while higher values increase per-instance throughput but may raise latency. For guidance on tuning cost and performance tradeoffs to your specific use-case, see Throughput versus latency versus cost tradeoffs.

The command allows unauthenticated access and configures a startup probe on port 8000 that allows more time to download and start using a large language model. The model used here is the modularai/Llama-3.1-8B-Instruct-GGUF model. Once the deployment is complete, Cloud Run provides a service URL where you can send inference requests to the Llama 3.1 model.

Test the deployment

After deployment completes, you can test the OpenAI-compatible endpoint.

Get the Cloud Run service URL with the following command:

SERVICE_URL=$(gcloud run services describe max-nvidia-full \
  --region=us-central1 \
  --format='value(status.url)')
SERVICE_URL=$(gcloud run services describe max-nvidia-full \
  --region=us-central1 \
  --format='value(status.url)')

Send a chat completion inference request to the max-nvidia-full service.

curl -N ${SERVICE_URL}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
      "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Why is the sky blue?"}
      ]
  }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N ${SERVICE_URL}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
      "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Why is the sky blue?"}
      ]
  }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

Metrics

Retrieve metrics about your Cloud Run service with the following command:

gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=max-nvidia-full" --limit 10
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=max-nvidia-full" --limit 10

You can also check the Google Cloud Run console for visualizations and detailed metrics about your max-nvidia-full service.

For more information on metrics and telemetry specific to the MAX container, see Metrics.

MAX container metrics are anonymous by default. To help our team analyze your deployment performance, you can add identifying environment variables. For more information see Deployment and user ID.

Cost considerations

When deploying applications on Google Cloud Run, understanding pricing factors can help you manage costs effectively. Cloud Run follows a pay-per-use model, meaning you only pay for the exact resources consumed during request execution.

Pricing factors

Cloud Run pricing is based on several key components:

Request count: You are billed per HTTP request processed by your service.
Resource allocation: The cost varies depending on the allocated CPU, memory, and (if applicable) GPU resources.
Request duration: You pay for the time each request takes to execute, measured in milliseconds.

See Cloud Run pricing for more information on pricing details.

Cost optimization strategies

To minimize costs while maintaining performance, consider these optimization techniques:

Right-size resources: Start with minimal CPU and memory allocations during development and testing. Avoid over-provisioning unless necessary.
Configure scaling wisely: Set appropriate minimum and maximum instance limits to prevent unnecessary scaling and costs.
Monitor cold starts: If cold start latency affects performance, consider keeping a small number of instances always running, but balance this with cost trade-offs.
Use spot instances: For non-critical or batch workloads, spot instances can offer significant savings compared to standard pricing.

Clean up

After you're done testing your service, remove the deployment and free up resources with the following command:

gcloud run services delete max-nvidia-full --region=${REGION}
gcloud run services delete max-nvidia-full --region=${REGION}

Next steps

MAX includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. For more detailed instructions on benchmarking, please see Benchmark MAX.

To stay up to date with new releases, sign up for our newsletter and join our community. If you're interested in becoming a design partner to get early access and give us feedback, please contact us.

You can also explore other GPU deployment options with MAX.

Deploy Llama 3 on GPU with MAX

Learn how to deploy Llama 3 on GPU with MAX.

Deploy Llama 3 on GPU-powered Kubernetes clusters

Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs

Requirements​

Deploy MAX to Cloud Run​

Test the deployment​

Metrics​

Cost considerations​

Pricing factors​

Cost optimization strategies​

Clean up​

Next steps​