
Serverless GPU inference on Google Cloud Run
MAX (Modular Accelerated Xecution) is a high-performance, flexible platform designed for AI workloads. This tutorial guides you through deploying the MAX container on Google Cloud Run to serve Llama 3.1 inference requests. By leveraging Cloud Run, you gain automatic scaling and serverless deployment without managing infrastructure.
Prerequisites
Before starting this tutorial, ensure that you have:
- A Google Cloud account with billing enabled
- The
gcloud
CLI tool installed and initialized
Resource requirements
Before proceeding with this tutorial, ensure that your Google Cloud project has access to the necessary quotas and system limits. For more information on compatible GPUs, see Supported GPU types.
This tutorial requires the following minimum resources:
- GPU: NVIDIA L4
- CPU: 8 vCPUs
- Memory: At least 32Gi
Deploy MAX to Cloud Run
This section guides you through deploying the MAX container for Llama 3.1 inference on Google Cloud Run with GPU acceleration.
- Before deploying, set up the required environment variables, including your Google Cloud project ID and a supported region for Cloud Run with GPUs.
export PROJECT_ID="your-project-id"
export REGION="us-central1"
export PROJECT_ID="your-project-id"
export REGION="us-central1"
- To use Cloud Run and Cloud Build, you must enable the necessary APIs:
gcloud services enable \
run.googleapis.com \
cloudbuild.googleapis.com
gcloud services enable \
run.googleapis.com \
cloudbuild.googleapis.com
- Now, deploy the MAX container to Cloud Run using the following command:
gcloud beta run deploy max-openai-api \
--image=modular/max-openai-api \
--region=us-central1 \
--platform=managed \
--memory=32Gi \
--cpu=8 \
--timeout=1200 \
--port=8000 \
--min-instances=1 \
--max-instances=5 \
--concurrency=5 \
--cpu-boost \
--args="--model-path=modularai/Llama-3.1-8B-Instruct-GGUF" \
--allow-unauthenticated \
--gpu=1 \
--gpu-type=nvidia-l4 \
--set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \
--startup-probe=tcpSocket.port=8000,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=5
gcloud beta run deploy max-openai-api \
--image=modular/max-openai-api \
--region=us-central1 \
--platform=managed \
--memory=32Gi \
--cpu=8 \
--timeout=1200 \
--port=8000 \
--min-instances=1 \
--max-instances=5 \
--concurrency=5 \
--cpu-boost \
--args="--model-path=modularai/Llama-3.1-8B-Instruct-GGUF" \
--allow-unauthenticated \
--gpu=1 \
--gpu-type=nvidia-l4 \
--set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \
--startup-probe=tcpSocket.port=8000,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=5
This command deploys a Google Cloud Run service named max-openai-api
using the
modular/max-openai-api
container image, allocating 32Gi of memory, 8 CPUs, and
1 NVIDIA L4 GPU in the us-central1
region with autoscaling between 1 and 5
instances. The --concurrency=5
flag limits each instance to handling a maximum
of 5 concurrent requests, triggering a new instance if the limit is exceeded.
You can adjust the maximum concurrent requests to balance throughput, latency,
and cost. Lower --concurrency
values reduce latency but require more
instances, while higher values increase per-instance throughput but may raise
latency. For guidance on tuning cost and performance tradeoffs to your specific
use-case, see Throughput versus latency versus cost tradeoffs.
The command allows unauthenticated access and configures a startup probe on
port 8000 that allows more time to download and start using a large language
model. The model used here is the modularai/Llama-3.1-8B-Instruct-GGUF
model.
Once the deployment is complete, Cloud Run provides a service URL where you can
send inference requests to the Llama 3.1 model.
Test the deployment
After deployment completes, you can test the OpenAI-compatible endpoint.
- Get the Cloud Run service URL with the following command:
SERVICE_URL=$(gcloud run services describe max-openai-api \
--region=us-central1 \
--format='value(status.url)')
SERVICE_URL=$(gcloud run services describe max-openai-api \
--region=us-central1 \
--format='value(status.url)')
- Send a chat completion inference request to the
max-openai-api
service.
curl -N ${SERVICE_URL}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is the sky blue?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N ${SERVICE_URL}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is the sky blue?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
Metrics
Retrieve metrics about your Cloud Run service with the following command:
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=max-openai-api" --limit 10
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=max-openai-api" --limit 10
You can also check the
Google Cloud Run console for
visualizations and detailed metrics about your max-openai-api
service.
For more information on metrics and telemetry specific to the MAX container, see Metrics.
Cost considerations
When deploying applications on Google Cloud Run, understanding pricing factors can help you manage costs effectively. Cloud Run follows a pay-per-use model, meaning you only pay for the exact resources consumed during request execution.
Pricing factors
Cloud Run pricing is based on several key components:
- Request count: You are billed per HTTP request processed by your service.
- Resource allocation: The cost varies depending on the allocated CPU, memory, and (if applicable) GPU resources.
- Request duration: You pay for the time each request takes to execute, measured in milliseconds.
See Cloud Run pricing for more information on pricing details.
Cost optimization strategies
To minimize costs while maintaining performance, consider these optimization techniques:
- Right-size resources: Start with minimal CPU and memory allocations during development and testing. Avoid over-provisioning unless necessary.
- Configure scaling wisely: Set appropriate minimum and maximum instance limits to prevent unnecessary scaling and costs.
- Monitor cold starts: If cold start latency affects performance, consider keeping a small number of instances always running, but balance this with cost trade-offs.
- Use spot instances: For non-critical or batch workloads, spot instances can offer significant savings compared to standard pricing.
Clean up
After you're done testing your service, remove the deployment and free up resources with the following command:
# Delete the Cloud Run service
gcloud run services delete max-openai-api --region=${REGION}
# Delete the Cloud Run service
gcloud run services delete max-openai-api --region=${REGION}
Next steps
MAX includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. For more detailed instructions on benchmarking, please see Benchmark MAX Serve.
To stay up to date with new releases, sign up for our newsletter and join our community. If you're interested in becoming a design partner to get early access and give us feedback, please contact us.
You can also explore other GPU deployment options with MAX.
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!