Deploy a TorchScript model locally with NVIDIA's Triton Server

Technical Writer

16 min read

triton

pytorch

MAX runs on GPU!

MAX continues to evolve and we have new tutorials to help you experience its power and capabilities firsthand. Check out Deploy Llama3 with MAX Serve on GPU and Deploy a PyTorch model from Hugging Face. Be sure to let us know what you think!

NVIDIA's Triton Inference Server is a common option for many organizations that want to deploy their inference engines into a production environment. In this tutorial, you'll learn how to locally deploy a Docker container of a Triton Inference Server that uses MAX Engine as its backend inference engine. This tutorial uses BERT for masked word prediction; however, you can apply this tutorial to many of the examples in the max GitHub repository.

Prerequisites

Before following the steps in this topic, make sure you completed the following:

Cloned the max repository. You can clone the repository with this command:

git clone -b stable git@github.com:modular/max.git
git clone -b stable git@github.com:modular/max.git

Installed Docker.

Create a virtual environment

Using a virtual environment ensures that you have the Python version and packages that are compatible with this project. We'll use the Magic CLI to create the environment and install the required packages.

If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:
curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the source command that's printed in your terminal.

Navigate into the BERT Python code example.

cd max/examples/inference/bert-python-torchscript
cd max/examples/inference/bert-python-torchscript

Now start a shell in the environment and see your MAX version:

magic shell

magic shell

python3 -c 'from max import engine; print(engine.__version__)'
python3 -c 'from max import engine; print(engine.__version__)'

Create your directory structure

Triton has very specific requirements when it comes to where models are stored. Your next action, then, is to create that directory structure.

First, create an environment variable for the model repository.

MODEL_REPOSITORY=~/model-repository
MODEL_REPOSITORY=~/model-repository

You also need a directory for the BERT model.

BERT_DIR=$MODEL_REPOSITORY/bert-mlm
BERT_DIR=$MODEL_REPOSITORY/bert-mlm

Then, you need a directory for the model itself:

mkdir -p $BERT_DIR/1
mkdir -p $BERT_DIR/1

Download the model

With the directory structure in place, you can now download the model. Because this model uses PyTorch, we need to convert the model into a TorchScript format. To make this step easier, we provide you with a script to perform these tasks.

cd # Or whatever directory you installed max!
cd # Or whatever directory you installed max!

python3 max/examples/inference/common/bert-torchscript/download-model.py \
  --output-path $BERT_DIR/1/bert-mlm.torchscript --mlm
python3 max/examples/inference/common/bert-torchscript/download-model.py \
  --output-path $BERT_DIR/1/bert-mlm.torchscript --mlm

This script handles downloading the model and converting it to TorchScript. When you work with your own PyTorch model, you'll need to convert it to TorchScript as well. You can learn how by reading PyTorch's Introduction to TorchScript.

Define the Triton configuration file

In addition to a specific directory structure, NVIDIA's Triton Inference Server also requires a model configuration file that follows a precise format. Let's create that file.

First, create a file, config.pbtxt.

touch $BERT_DIR/config.pbtxt
touch $BERT_DIR/config.pbtxt

Next, open the file and paste the following content.

input {
  name: "input_ids"
  data_type: TYPE_INT64
  dims: [1,128]
}
input {
  name: "attention_mask"
  data_type: TYPE_INT64
  dims: [1,128]
}
input {
  name: "token_type_ids"
  data_type: TYPE_INT64
  dims: [1,128]
}
output {
  name: "result0"
  data_type: TYPE_FP32
  dims: [1, 128, 768]
}
instance_group {
  kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"
input {
  name: "input_ids"
  data_type: TYPE_INT64
  dims: [1,128]
}
input {
  name: "attention_mask"
  data_type: TYPE_INT64
  dims: [1,128]
}
input {
  name: "token_type_ids"
  data_type: TYPE_INT64
  dims: [1,128]
}
output {
  name: "result0"
  data_type: TYPE_FP32
  dims: [1, 128, 768]
}
instance_group {
  kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"

Essentially, this file tells Triton what inputs the model should expect to receive, and what type of output it should return.

Verify the file and directory structure

Before we run the Docker container, we need to make sure that our file and directory structure is correct. Do that by running the following command:

tree $MODEL_REPOSITORY
tree $MODEL_REPOSITORY

The output of the tree command should be similar to the following. (The root path might be different depending on where you cloned the repository.)

/home/ubuntu/model-repository
└── bert-mlm
    ├── 1
    │   └── bert-mlm.torchscript
    └── config.pbtxt

2 directories, 2 files
/home/ubuntu/model-repository
└── bert-mlm
    ├── 1
    │   └── bert-mlm.torchscript
    └── config.pbtxt

2 directories, 2 files

If your directory structure resembles the preceding example, then you are ready to run the Docker container! If not, then make the necessary adjustments until the directory structure is correct.

Run the Docker image

In this step, you'll get the latest Docker image for running Triton on MAX and use it to serve the BERT model.

docker run -it --rm --net=host \
  -v $MODEL_REPOSITORY:/models \
  public.ecr.aws/modular/max-serving:latest \
  tritonserver --model-repository=/models
docker run -it --rm --net=host \
  -v $MODEL_REPOSITORY:/models \
  public.ecr.aws/modular/max-serving:latest \
  tritonserver --model-repository=/models

As the Docker container starts up, MAX Engine compiles the model. After that's done and the server is running, you'll see the usual Triton logs, including the endpoints that are running:

I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

For now, leave this terminal running--we'll use a second terminal to make inference requests.

Send inference requests

We'll use a second terminal to create a script to connect to the Triton Inference Server and send inference requests.

To start, open a second terminal and nagivate to the bert-python-torchscript directory:

cd ~/max/examples/inference/bert-python-torchscript
cd ~/max/examples/inference/bert-python-torchscript

Start up the magic shell environment.

magic shell

magic shell

Next, create a triton-inference.py script.

touch triton-inference.py
touch triton-inference.py

Add the following code to the script.

#!/usr/bin/env python3

import os

# suppress extraneous logging
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from argparse import ArgumentParser

import numpy as np
import torch
import tritonclient.http as httpclient
from transformers import BertTokenizer

BATCH = 1
SEQLEN = 128
DEFAULT_MODEL_NAME = "bert-mlm"
DESCRIPTION = "BERT model"
HF_MODEL_NAME = "bert-base-uncased"


def execute(triton_client, model_name, inputs):
    # Set the input data
    triton_inputs = [
        httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT64"),
        httpclient.InferInput(
            "attention_mask", inputs["attention_mask"].shape, "INT64"
        ),
        httpclient.InferInput(
            "token_type_ids", inputs["token_type_ids"].shape, "INT64"
        ),
    ]
    triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int64))
    triton_inputs[1].set_data_from_numpy(
        inputs["attention_mask"].astype(np.int64)
    )
    triton_inputs[2].set_data_from_numpy(
        inputs["token_type_ids"].astype(np.int64)
    )

    print("Executing model...")
    results = triton_client.infer(model_name, triton_inputs)
    print("Model executed.\n")

    return results.as_numpy("result0")


def main():
    # Parse args
    parser = ArgumentParser(description=DESCRIPTION)
    parser.add_argument(
        "--input",
        type=str,
        metavar="str",
        required=True,
        help="Text with a masked token.",
    )
    parser.add_argument(
        "--model-name",
        type=str,
        default=DEFAULT_MODEL_NAME,
        help="Model name to execute inference.",
    )
    parser.add_argument(
        "--url",
        type=str,
        required=False,
        default="localhost:8000",
        help="Inference server URL. Default is localhost:8000.",
    )
    args = parser.parse_args()
    torch.set_default_device("cpu")
    # Create a triton client
    triton_client = httpclient.InferenceServerClient(url=args.url)

    # Preprocess input statement
    print("Processing input...")
    tokenizer = BertTokenizer.from_pretrained(HF_MODEL_NAME)
    inputs = tokenizer(
        args.input,
        return_tensors="np",
        return_token_type_ids=True,
        padding="max_length",
        truncation=True,
        max_length=SEQLEN,
    )
    print("Input processed.\n")
    masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
    outputs = execute(triton_client, args.model_name, inputs)
    logits = torch.from_numpy(outputs[0, masked_index, :])
    predicted_token_ids = logits.argmax(dim=-1)
    predicted_tokens = [
        tokenizer.decode(
            [token_id],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True,
        )
        for token_id in predicted_token_ids
    ]
    filled_mask = "".join(predicted_tokens)
    # Get the predictions for the masked token
    print(f"input text: {args.input}")
    print(f"filled mask: {args.input.replace('[MASK]', filled_mask)}")


if __name__ == "__main__":
    main()
#!/usr/bin/env python3

import os

# suppress extraneous logging
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from argparse import ArgumentParser

import numpy as np
import torch
import tritonclient.http as httpclient
from transformers import BertTokenizer

BATCH = 1
SEQLEN = 128
DEFAULT_MODEL_NAME = "bert-mlm"
DESCRIPTION = "BERT model"
HF_MODEL_NAME = "bert-base-uncased"


def execute(triton_client, model_name, inputs):
    # Set the input data
    triton_inputs = [
        httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT64"),
        httpclient.InferInput(
            "attention_mask", inputs["attention_mask"].shape, "INT64"
        ),
        httpclient.InferInput(
            "token_type_ids", inputs["token_type_ids"].shape, "INT64"
        ),
    ]
    triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int64))
    triton_inputs[1].set_data_from_numpy(
        inputs["attention_mask"].astype(np.int64)
    )
    triton_inputs[2].set_data_from_numpy(
        inputs["token_type_ids"].astype(np.int64)
    )

    print("Executing model...")
    results = triton_client.infer(model_name, triton_inputs)
    print("Model executed.\n")

    return results.as_numpy("result0")


def main():
    # Parse args
    parser = ArgumentParser(description=DESCRIPTION)
    parser.add_argument(
        "--input",
        type=str,
        metavar="str",
        required=True,
        help="Text with a masked token.",
    )
    parser.add_argument(
        "--model-name",
        type=str,
        default=DEFAULT_MODEL_NAME,
        help="Model name to execute inference.",
    )
    parser.add_argument(
        "--url",
        type=str,
        required=False,
        default="localhost:8000",
        help="Inference server URL. Default is localhost:8000.",
    )
    args = parser.parse_args()
    torch.set_default_device("cpu")
    # Create a triton client
    triton_client = httpclient.InferenceServerClient(url=args.url)

    # Preprocess input statement
    print("Processing input...")
    tokenizer = BertTokenizer.from_pretrained(HF_MODEL_NAME)
    inputs = tokenizer(
        args.input,
        return_tensors="np",
        return_token_type_ids=True,
        padding="max_length",
        truncation=True,
        max_length=SEQLEN,
    )
    print("Input processed.\n")
    masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
    outputs = execute(triton_client, args.model_name, inputs)
    logits = torch.from_numpy(outputs[0, masked_index, :])
    predicted_token_ids = logits.argmax(dim=-1)
    predicted_tokens = [
        tokenizer.decode(
            [token_id],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True,
        )
        for token_id in predicted_token_ids
    ]
    filled_mask = "".join(predicted_tokens)
    # Get the predictions for the masked token
    print(f"input text: {args.input}")
    print(f"filled mask: {args.input.replace('[MASK]', filled_mask)}")


if __name__ == "__main__":
    main()

Now, let's return to the home directory and run the script.

cd

cd

python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
  --input "Paris is the [MASK] of France."
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
  --input "Paris is the [MASK] of France."

You should see the results like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.
input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

That's it! You're now running a Triton Server with MAX Engine as the backend. Be sure to try other text inputs, for example:

python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
  --input "This is the [MASK] tutorial ever!"
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
  --input "This is the [MASK] tutorial ever!"

You can also fetch some metadata about the service:

Check if the container is started and ready:

curl -v localhost:8000/v2/health/ready | python3 -m json.tool
curl -v localhost:8000/v2/health/ready | python3 -m json.tool

Get the model metadata (input/output parameters):

curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool

Get the loaded model configuration:

curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool
curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool

Next steps

In this tutorial, you've explored how to build your own Docker image of NVIDIA's Triton Inference Server, using MAX Engine as the inference backend. You can use what you've learned here to build your own Docker images, either using other models included with MAX or custom models you've built yourself.

Deploy a model with Amazon SageMaker and AWS CloudFormation

Learn how to deploy a model using MAX Engine and AWS SageMaker

Deploy a model with Kubernetes and Helm

Learn how to deploy a model using MAX Engine and Kubernetes

Prerequisites​

Create a virtual environment​

Create your directory structure​

Download the model​

Define the Triton configuration file​

Verify the file and directory structure​

Run the Docker image​

Send inference requests​

Next steps​