Skip to main content
Log in

Deploy a model with Kubernetes and Helm

Dave Shevitz

Scalability is an essential part of deploying a model. You need to make sure that your application has the resources it needs to meet the demands of incoming inferencing requests.

This is where MAX comes in. MAX includes a state-of-the-art graph compiler and runtime library that executes models from PyTorch and with incredible inference speed on a wide range of hardware.

In this tutorial, you'll deploy a model using AWS Elastic Kubernetes Service, a managed Kubernetes service provided by Amazon Web Services (AWS). You'll build this deployment using a Helm, a package manager for Kubernetes. At the end of the tutorial, you'll have created a complete deployment stack that combines MAX Engine with AWS Elastic Kubernetes Service.

Previous experience with Kubernetes and Helm are not required; we've created a template specifically for this tutorial. We'll guide you through each step!

About Kubernetes

Kubernetes is an open-source container orchestration platform designed to automate the deployment, scaling, and management of containerized applications. Kubernetes allows you to efficiently manage clusters of containers, ensuring high availability and fault tolerance. Kubernetes provides features such as load balancing, service discovery, automated rollouts and rollbacks, and secret and configuration management, making it a powerful tool for maintaining robust and scalable microservices architectures.

About Helm

Helm is a package manager for Kubernetes, which simplifies the deployment and management of applications on Kubernetes clusters. Often referred to as the "Kubernetes package manager," Helm allows users to define, install, and upgrade even the most complex Kubernetes applications. It uses a packaging format called charts, which are collections of files that describe a related set of Kubernetes resources. Helm helps manage Kubernetes applications by streamlining the configuration process, enabling version control, and making it easier to share and reuse Kubernetes applications across different environments.

Prerequisites

To complete this tutorial, make sure you have the following utilities installed.

UtilityDescriptionHomebrew CommandLink
kubectl

Kubernetes command-line tool used for interacting with Kubernetes clusters.

brew kubetcl

https://kubernetes.io/docs/tasks/tools/#kubectl

awscli

Command-line interface for Amazon Web Services (AWS), enabling users to manage various AWS services.

brew awscli

https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

eksctl

Command-line utility for managing Amazon Elastic Kubernetes Service (EKS) clusters.

brew eksctl

https://eksctl.io/installation/

helm

Package manager for Kubernetes, facilitating the deployment and management of applications on Kubernetes clusters through charts.

brew helm

https://helm.sh/docs/intro/install/

Get started

Your first step in deploying a model is to define your deployment environment. For this tutorial, this environment includes:

  • the name of your AWS region
  • the name of your Kubernetes cluster
  • the namespace of your Kubernetes cluster
  • the name of the service account that your deployment uses to manage resources

Let's make things easier for ourselves and create the following environment variables.

AWS_REGION=us-east-1
CLUSTER_NAME=max-deploy-demo
NAMESPACE_NAME=max-deploy-demo
SERVICE_ACCOUNT_NAME=max-deploy-demo-sa
AWS_REGION=us-east-1
CLUSTER_NAME=max-deploy-demo
NAMESPACE_NAME=max-deploy-demo
SERVICE_ACCOUNT_NAME=max-deploy-demo-sa

Your next task is to sign in to AWS using the aws cli tool. We're using this tool because, as this is a tutorial, we aren't exposing any endpoints to the internet.

To sign in to AWS, use the following command:

aws sso login
aws sso login

Configure the Kubernetes cluster

Now you're ready to create an AWS Elastic Kubernetes (EKS) cluster. This resource is a Kubernetes cluster that dynamically scales as workloads and other demands require.

eksctl create cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--node-type c5.4xlarge \
--nodes 1
eksctl create cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--node-type c5.4xlarge \
--nodes 1

To deploy your cluster, you need to associate the OpenID Connect (OIDC) provider for the EKS cluster with AWS Identity Access Management. This step handles the authentication needed so the pods in your EKS cluster can assume IAM roles and access AWS APIs.

eksctl utils associate-iam-oidc-provider \
--region $AWS_REGION \
--cluster $CLUSTER_NAME \
--approve
eksctl utils associate-iam-oidc-provider \
--region $AWS_REGION \
--cluster $CLUSTER_NAME \
--approve

Next, define an EKS namespace to contain your EKS cluster. This namespace allows you to better organize the resources your cluster contains.

kubectl create namespace $NAMESPACE_NAME
kubectl create namespace $NAMESPACE_NAME

Last, let's create an AWS IAM role and associate it with your Kubernetes service account. With this IAM service account, your Kubernetes pods gain read-only access to Amazon S3.

eksctl create iamserviceaccount \
--name $SERVICE_ACCOUNT_NAME \
--namespace $NAMESPACE_NAME \
--cluster $CLUSTER_NAME \
--region $AWS_REGION \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
--approve \
--override-existing-serviceaccounts
eksctl create iamserviceaccount \
--name $SERVICE_ACCOUNT_NAME \
--namespace $NAMESPACE_NAME \
--cluster $CLUSTER_NAME \
--region $AWS_REGION \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
--approve \
--override-existing-serviceaccounts

Deploy the model using a Helm chart

At this point, you can now deploy your model! You'll use Helm to install a pre-built Kubernetes chart.

helm install max-deploy oci://public.ecr.aws/modular/max-serving-chart \
--version 24.4.0 \
--namespace $NAMESPACE_NAME \
--set serviceAccountName=$SERVICE_ACCOUNT_NAME \
--set image.modelRepositoryPath=s3://max-serving-models-$AWS_REGION-public/kubernetes/bert/model-repository \
--wait \
--timeout 15m
helm install max-deploy oci://public.ecr.aws/modular/max-serving-chart \
--version 24.4.0 \
--namespace $NAMESPACE_NAME \
--set serviceAccountName=$SERVICE_ACCOUNT_NAME \
--set image.modelRepositoryPath=s3://max-serving-models-$AWS_REGION-public/kubernetes/bert/model-repository \
--wait \
--timeout 15m

This command takes between 5 and 10 minutes to complete. When the deployment finishes, you should see output similar to the following.

NAME: max-deploy
LAST DEPLOYED: Tue Apr 16 15:51:24 2024
NAMESPACE: max-deploy-demo
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "The application is available at the following DNS name from within your cluster:"
echo "max-deploy.max-deploy-demo.svc.cluster.local:$CONTAINER_PORT"
echo "Or use the following command to forward ports and visit it locally at http://127.0.0.1:8000"
echo "kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace max-deploy-demo"
NAME: max-deploy
LAST DEPLOYED: Tue Apr 16 15:51:24 2024
NAMESPACE: max-deploy-demo
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "The application is available at the following DNS name from within your cluster:"
echo "max-deploy.max-deploy-demo.svc.cluster.local:$CONTAINER_PORT"
echo "Or use the following command to forward ports and visit it locally at http://127.0.0.1:8000"
echo "kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace max-deploy-demo"

To access your deployment, set the following environment variables:

export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")

Now run the following command:

kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace $NAMESPACE_NAME
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace $NAMESPACE_NAME

The following message appears on your terminal.

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

This command uses port forwarding so you can access your cluster from your local machine.

Test your deployment

You are now ready to test your deployment. This tutorial uses NVIDIA's Triton client to send text to a Bert model.

  1. Open a new terminal window.

  2. Install the required dependencies for the test script.

    python3 -m venv venv && source venv/bin/activate
    python3 -m venv venv && source venv/bin/activate
    pip install transformers tritonclient[http]
    pip install transformers tritonclient[http]
  3. Create the following python script, client.py.

    # suppress extraneous logging
    import os
    os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
    os.environ["TOKENIZERS_PARALLELISM"] = "false"

    import numpy as np

    import tritonclient.http as httpclient
    from transformers import AutoTokenizer

    text = "Paris is the [MASK] of France."

    # Create a triton client
    triton_client = httpclient.InferenceServerClient(url="127.0.0.1:8000")

    # Preprocess input statement
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    inputs = tokenizer(
    text,
    return_tensors="np",
    return_token_type_ids=True,
    padding="max_length",
    truncation=True,
    max_length=128,
    )

    # Set the input data
    triton_inputs = [
    httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT32"),
    httpclient.InferInput("attention_mask", inputs["attention_mask"].shape, "INT32"),
    httpclient.InferInput("token_type_ids", inputs["token_type_ids"].shape, "INT32"),
    ]
    triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int32))
    triton_inputs[1].set_data_from_numpy(inputs["attention_mask"].astype(np.int32))
    triton_inputs[2].set_data_from_numpy(inputs["token_type_ids"].astype(np.int32))

    # Executing
    output = triton_client.infer("bert-base-uncased", triton_inputs)
    # Executing
    output = triton_client.infer("bert-base-uncased", triton_inputs)

    # Post-processing
    masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
    logits = output.as_numpy("result0")[0, masked_index, :]
    predicted_token_ids = logits.argmax(axis=-1)
    predicted_text = tokenizer.decode(predicted_token_ids)
    output_text = text.replace("[MASK]", predicted_text)
    print(output_text)
    # suppress extraneous logging
    import os
    os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
    os.environ["TOKENIZERS_PARALLELISM"] = "false"

    import numpy as np

    import tritonclient.http as httpclient
    from transformers import AutoTokenizer

    text = "Paris is the [MASK] of France."

    # Create a triton client
    triton_client = httpclient.InferenceServerClient(url="127.0.0.1:8000")

    # Preprocess input statement
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    inputs = tokenizer(
    text,
    return_tensors="np",
    return_token_type_ids=True,
    padding="max_length",
    truncation=True,
    max_length=128,
    )

    # Set the input data
    triton_inputs = [
    httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT32"),
    httpclient.InferInput("attention_mask", inputs["attention_mask"].shape, "INT32"),
    httpclient.InferInput("token_type_ids", inputs["token_type_ids"].shape, "INT32"),
    ]
    triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int32))
    triton_inputs[1].set_data_from_numpy(inputs["attention_mask"].astype(np.int32))
    triton_inputs[2].set_data_from_numpy(inputs["token_type_ids"].astype(np.int32))

    # Executing
    output = triton_client.infer("bert-base-uncased", triton_inputs)
    # Executing
    output = triton_client.infer("bert-base-uncased", triton_inputs)

    # Post-processing
    masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
    logits = output.as_numpy("result0")[0, masked_index, :]
    predicted_token_ids = logits.argmax(axis=-1)
    predicted_text = tokenizer.decode(predicted_token_ids)
    output_text = text.replace("[MASK]", predicted_text)
    print(output_text)
  4. Run the example script to see its output.

    python client.py
    python client.py

The script sends the text Paris is the [MASK] of France. The output of the script reads:

Paris is the capital of France.
Paris is the capital of France.

Clean up

We've now wrapped up the tasks we wanted to accomplish in this tutorial! To avoid incurring additional costs for AWS resources, we recommend you delete the resources you've built.

To delete tutorial resources:

  1. Uninstall MAX serve.

    helm uninstall max-deploy --namespace $NAMESPACE_NAME
    helm uninstall max-deploy --namespace $NAMESPACE_NAME
  2. Delete the Kubernetes namespace.

    kubectl delete namespace $NAMESPACE_NAME
    kubectl delete namespace $NAMESPACE_NAME
  3. Delete the service account.

    eksctl delete iamserviceaccount \
    --name $SERVICE_ACCOUNT_NAME \
    --namespace $NAMESPACE_NAME \
    --cluster $CLUSTER_NAME \
    --region $AWS_REGION
    eksctl delete iamserviceaccount \
    --name $SERVICE_ACCOUNT_NAME \
    --namespace $NAMESPACE_NAME \
    --cluster $CLUSTER_NAME \
    --region $AWS_REGION
  4. Delete the Kubernetes cluster.

    eksctl delete cluster \
    --name $CLUSTER_NAME \
    --region $AWS_REGION
    eksctl delete cluster \
    --name $CLUSTER_NAME \
    --region $AWS_REGION

Next steps

In this tutorial, you've leveraged a Helm chart to deploy MAX Engine to an AWS Elastic Kubernetes Cluster. This deployment used MAX engine to handle inference requests for a BERT model. The deployment took a text input, analyzed the input, and returned what the model predicted the sentiment for that input.

We encourage you to use what you learned here to deploy other models, and extend this tutorial as needed to explore other MAX features.

Here are some other topics to explore next:

Did this tutorial work for you?