Deploy a model with Kubernetes and Helm

Technical Writer

19 min read

aws

kubernetes

MAX runs on GPU!

MAX continues to evolve and we have new tutorials to help you experience its power and capabilities firsthand. Check out Deploy Llama 3 on GPU-powered Kubernetes clusters. Be sure to let us know what you think!

Scalability is an essential part of deploying a model. You need to make sure that your application has the resources it needs to meet the demands of incoming inferencing requests.

This is where MAX comes in. MAX includes a state-of-the-art graph compiler and runtime library that executes models from PyTorch and with incredible inference speed on a wide range of hardware.

In this tutorial, you'll deploy a model using AWS Elastic Kubernetes Service, a managed Kubernetes service provided by Amazon Web Services (AWS). You'll build this deployment using a Helm, a package manager for Kubernetes. At the end of the tutorial, you'll have created a complete deployment stack that combines MAX Engine with AWS Elastic Kubernetes Service.

Previous experience with Kubernetes and Helm are not required; we've created a template specifically for this tutorial. We'll guide you through each step!

About Kubernetes

Kubernetes is an open-source container orchestration platform designed to automate the deployment, scaling, and management of containerized applications. Kubernetes allows you to efficiently manage clusters of containers, ensuring high availability and fault tolerance. Kubernetes provides features such as load balancing, service discovery, automated rollouts and rollbacks, and secret and configuration management, making it a powerful tool for maintaining robust and scalable microservices architectures.

About Helm

Helm is a package manager for Kubernetes, which simplifies the deployment and management of applications on Kubernetes clusters. Often referred to as the "Kubernetes package manager," Helm allows users to define, install, and upgrade even the most complex Kubernetes applications. It uses a packaging format called charts, which are collections of files that describe a related set of Kubernetes resources. Helm helps manage Kubernetes applications by streamlining the configuration process, enabling version control, and making it easier to share and reuse Kubernetes applications across different environments.

Prerequisites

To complete this tutorial, make sure you have the following utilities installed.

Utility	Description	Homebrew Command	Link
`kubectl`	Kubernetes command-line tool used for interacting with Kubernetes clusters.	`brew kubetcl`	https://kubernetes.io/docs/tasks/tools/#kubectl
`awscli`	Command-line interface for Amazon Web Services (AWS), enabling users to manage various AWS services.	`brew awscli`	https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
`eksctl`	Command-line utility for managing Amazon Elastic Kubernetes Service (EKS) clusters.	`brew eksctl`	https://eksctl.io/installation/
`helm`	Package manager for Kubernetes, facilitating the deployment and management of applications on Kubernetes clusters through charts.	`brew helm`	https://helm.sh/docs/intro/install/

Get started

Your first step in deploying a model is to define your deployment environment. For this tutorial, this environment includes:

the name of your AWS region
the name of your Kubernetes cluster
the namespace of your Kubernetes cluster
the name of the service account that your deployment uses to manage resources

Let's make things easier for ourselves and create the following environment variables.

AWS_REGION=us-east-1
CLUSTER_NAME=max-deploy-demo
NAMESPACE_NAME=max-deploy-demo
SERVICE_ACCOUNT_NAME=max-deploy-demo-sa
AWS_REGION=us-east-1
CLUSTER_NAME=max-deploy-demo
NAMESPACE_NAME=max-deploy-demo
SERVICE_ACCOUNT_NAME=max-deploy-demo-sa

Your next task is to sign in to AWS using the aws cli tool. We're using this tool because, as this is a tutorial, we aren't exposing any endpoints to the internet.

We recommend you use the AWS SSO token provider configuration. You can create this configuration by running aws configure sso. This command requires an SSO Start Url and an SSO Region. The values for these parameters depends on your AWS configuration. To learn more, see Configure the AWS CLI to use AWS IAM Identity Center.

To sign in to AWS, use the following command:

aws sso login

aws sso login

Configure the Kubernetes cluster

Now you're ready to create an AWS Elastic Kubernetes (EKS) cluster. This resource is a Kubernetes cluster that dynamically scales as workloads and other demands require.

eksctl create cluster \
  --name $CLUSTER_NAME \
  --region $AWS_REGION \
  --node-type c5.4xlarge \
  --nodes 1
eksctl create cluster \
  --name $CLUSTER_NAME \
  --region $AWS_REGION \
  --node-type c5.4xlarge \
  --nodes 1

To deploy your cluster, you need to associate the OpenID Connect (OIDC) provider for the EKS cluster with AWS Identity Access Management. This step handles the authentication needed so the pods in your EKS cluster can assume IAM roles and access AWS APIs.

eksctl utils associate-iam-oidc-provider \
  --region $AWS_REGION \
  --cluster $CLUSTER_NAME \
  --approve
eksctl utils associate-iam-oidc-provider \
  --region $AWS_REGION \
  --cluster $CLUSTER_NAME \
  --approve

Next, define an EKS namespace to contain your EKS cluster. This namespace allows you to better organize the resources your cluster contains.

kubectl create namespace $NAMESPACE_NAME
kubectl create namespace $NAMESPACE_NAME

Last, let's create an AWS IAM role and associate it with your Kubernetes service account. With this IAM service account, your Kubernetes pods gain read-only access to Amazon S3.

eksctl create iamserviceaccount \
  --name $SERVICE_ACCOUNT_NAME \
  --namespace $NAMESPACE_NAME \
  --cluster $CLUSTER_NAME \
  --region $AWS_REGION \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
  --approve \
  --override-existing-serviceaccounts
eksctl create iamserviceaccount \
  --name $SERVICE_ACCOUNT_NAME \
  --namespace $NAMESPACE_NAME \
  --cluster $CLUSTER_NAME \
  --region $AWS_REGION \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
  --approve \
  --override-existing-serviceaccounts

Deploy the model using a Helm chart

At this point, you can now deploy your model! You'll use Helm to install a pre-built Kubernetes chart.

helm install max-deploy oci://public.ecr.aws/modular/max-serving-chart \
  --version 24.4.0 \
  --namespace $NAMESPACE_NAME \
  --set serviceAccountName=$SERVICE_ACCOUNT_NAME \
  --set image.modelRepositoryPath=s3://max-serving-models-$AWS_REGION-public/kubernetes/bert/model-repository \
  --wait \
  --timeout 15m
helm install max-deploy oci://public.ecr.aws/modular/max-serving-chart \
  --version 24.4.0 \
  --namespace $NAMESPACE_NAME \
  --set serviceAccountName=$SERVICE_ACCOUNT_NAME \
  --set image.modelRepositoryPath=s3://max-serving-models-$AWS_REGION-public/kubernetes/bert/model-repository \
  --wait \
  --timeout 15m

This command takes between 5 and 10 minutes to complete. When the deployment finishes, you should see output similar to the following.

NAME: max-deploy
LAST DEPLOYED: Tue Apr 16 15:51:24 2024
NAMESPACE: max-deploy-demo
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
  export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  echo "The application is available at the following DNS name from within your cluster:"
  echo "max-deploy.max-deploy-demo.svc.cluster.local:$CONTAINER_PORT"
  echo "Or use the following command to forward ports and visit it locally at http://127.0.0.1:8000"
  echo "kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace max-deploy-demo"
NAME: max-deploy
LAST DEPLOYED: Tue Apr 16 15:51:24 2024
NAMESPACE: max-deploy-demo
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
  export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  echo "The application is available at the following DNS name from within your cluster:"
  echo "max-deploy.max-deploy-demo.svc.cluster.local:$CONTAINER_PORT"
  echo "Or use the following command to forward ports and visit it locally at http://127.0.0.1:8000"
  echo "kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace max-deploy-demo"

To access your deployment, set the following environment variables:

export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")

export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")

Now run the following command:

kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace $NAMESPACE_NAME
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace $NAMESPACE_NAME

The following message appears on your terminal.

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

This command uses port forwarding so you can access your cluster from your local machine.

Test your deployment

You are now ready to test your deployment. This tutorial uses NVIDIA's Triton client to send text to a Bert model.

Open a new terminal window.

Install the required dependencies for the test script.

python3 -m venv venv && source venv/bin/activate
python3 -m venv venv && source venv/bin/activate

pip install transformers tritonclient[http]
pip install transformers tritonclient[http]

Create the following python script, client.py.

# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np

import tritonclient.http as httpclient
from transformers import AutoTokenizer

text = "Paris is the [MASK] of France."

# Create a triton client
triton_client = httpclient.InferenceServerClient(url="127.0.0.1:8000")

# Preprocess input statement
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(
    text,
    return_tensors="np",
    return_token_type_ids=True,
    padding="max_length",
    truncation=True,
    max_length=128,
)

# Set the input data
triton_inputs = [
    httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT32"),
    httpclient.InferInput("attention_mask", inputs["attention_mask"].shape, "INT32"),
    httpclient.InferInput("token_type_ids", inputs["token_type_ids"].shape, "INT32"),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int32))
triton_inputs[1].set_data_from_numpy(inputs["attention_mask"].astype(np.int32))
triton_inputs[2].set_data_from_numpy(inputs["token_type_ids"].astype(np.int32))

# Executing
output = triton_client.infer("bert-base-uncased", triton_inputs)
# Executing
output = triton_client.infer("bert-base-uncased", triton_inputs)

# Post-processing
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
logits = output.as_numpy("result0")[0, masked_index, :]
predicted_token_ids = logits.argmax(axis=-1)
predicted_text = tokenizer.decode(predicted_token_ids)
output_text = text.replace("[MASK]", predicted_text)
print(output_text)
# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np

import tritonclient.http as httpclient
from transformers import AutoTokenizer

text = "Paris is the [MASK] of France."

# Create a triton client
triton_client = httpclient.InferenceServerClient(url="127.0.0.1:8000")

# Preprocess input statement
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(
    text,
    return_tensors="np",
    return_token_type_ids=True,
    padding="max_length",
    truncation=True,
    max_length=128,
)

# Set the input data
triton_inputs = [
    httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT32"),
    httpclient.InferInput("attention_mask", inputs["attention_mask"].shape, "INT32"),
    httpclient.InferInput("token_type_ids", inputs["token_type_ids"].shape, "INT32"),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int32))
triton_inputs[1].set_data_from_numpy(inputs["attention_mask"].astype(np.int32))
triton_inputs[2].set_data_from_numpy(inputs["token_type_ids"].astype(np.int32))

# Executing
output = triton_client.infer("bert-base-uncased", triton_inputs)
# Executing
output = triton_client.infer("bert-base-uncased", triton_inputs)

# Post-processing
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
logits = output.as_numpy("result0")[0, masked_index, :]
predicted_token_ids = logits.argmax(axis=-1)
predicted_text = tokenizer.decode(predicted_token_ids)
output_text = text.replace("[MASK]", predicted_text)
print(output_text)

Run the example script to see its output.
```
python client.py
```
```
python client.py
```

The script sends the text Paris is the [MASK] of France. The output of the script reads:

Paris is the capital of France.

Paris is the capital of France.

Feel free to try other sentences! Update the line of the script that reads text = "Paris is the [MASK] of France." and replace the string with one of your own.

Clean up

We've now wrapped up the tasks we wanted to accomplish in this tutorial! To avoid incurring additional costs for AWS resources, we recommend you delete the resources you've built.

To delete tutorial resources:

Uninstall MAX serve.

helm uninstall max-deploy --namespace $NAMESPACE_NAME
helm uninstall max-deploy --namespace $NAMESPACE_NAME

Delete the Kubernetes namespace.

kubectl delete namespace $NAMESPACE_NAME
kubectl delete namespace $NAMESPACE_NAME

Delete the service account.

eksctl delete iamserviceaccount \
 --name $SERVICE_ACCOUNT_NAME \
 --namespace $NAMESPACE_NAME \
 --cluster $CLUSTER_NAME \
 --region $AWS_REGION
eksctl delete iamserviceaccount \
 --name $SERVICE_ACCOUNT_NAME \
 --namespace $NAMESPACE_NAME \
 --cluster $CLUSTER_NAME \
 --region $AWS_REGION

Delete the Kubernetes cluster.

eksctl delete cluster \
 --name $CLUSTER_NAME \
 --region $AWS_REGION
eksctl delete cluster \
 --name $CLUSTER_NAME \
 --region $AWS_REGION

Next steps

In this tutorial, you've leveraged a Helm chart to deploy MAX Engine to an AWS Elastic Kubernetes Cluster. This deployment used MAX engine to handle inference requests for a BERT model. The deployment took a text input, analyzed the input, and returned what the model predicted the sentiment for that input.

We encourage you to use what you learned here to deploy other models, and extend this tutorial as needed to explore other MAX features.

Here are some other topics to explore next:

Deploy a model with AWS CloudFormation

Learn how to deploy a model using MAX Engine and AWS CloudFormation.

Modular pricing

Learn about the licensing and support options for developers and enterprises.

About Kubernetes​

About Helm​

Prerequisites​

Get started​

Configure the Kubernetes cluster​

Deploy the model using a Helm chart​

Test your deployment​

Clean up​

Next steps​