Skip to main content
Log in

Deploy a TorchScript model locally with NVIDIA's Triton Server

Dave Shevitz

NVIDIA's Triton Inference Server is a common option for many organizations that want to deploy their inference engines into a production environment. In this tutorial, you'll learn how to locally deploy a Docker container of a Triton Inference Server that uses MAX Engine as its backend inference engine. This tutorial uses BERT for masked word prediction; however, you can apply this tutorial to many of the examples in the max GitHub repository.

Prerequisites

Before following the steps in this topic, make sure you completed the following:

  • Cloned the max repository. You can clone the repository with this command:

    git clone git@github.com:modularml/max.git
    git clone git@github.com:modularml/max.git
  • Installed Docker.

Create a virtual environment

Using a virtual environment ensures that you have the Python version and packages that are compatible with this project. We'll use the Magic CLI to create the environment and install the required packages.

  1. If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

    curl -ssL https://magic.modular.com/ | bash
    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Navigate into the BERT Python code example.

    cd max/examples/inference/bert-python-torchscript
    cd max/examples/inference/bert-python-torchscript
  3. Now start a shell in the environment and see your MAX version:

    magic shell
    magic shell
    python3 -c 'from max import engine; print(engine.__version__)'
    python3 -c 'from max import engine; print(engine.__version__)'

Create your directory structure

Triton has very specific requirements when it comes to where models are stored. Your next action, then, is to create that directory structure.

First, create an environment variable for the model repository.

MODEL_REPOSITORY=~/model-repository
MODEL_REPOSITORY=~/model-repository

You also need a directory for the BERT model.

BERT_DIR=$MODEL_REPOSITORY/bert-mlm
BERT_DIR=$MODEL_REPOSITORY/bert-mlm

Then, you need a directory for the model itself:

mkdir -p $BERT_DIR/1
mkdir -p $BERT_DIR/1

Download the model

With the directory structure in place, you can now download the model. Because this model uses PyTorch, we need to convert the model into a TorchScript format. To make this step easier, we provide you with a script to perform these tasks.

cd # Or whatever directory you installed max!
cd # Or whatever directory you installed max!
python3 max/examples/inference/common/bert-torchscript/download-model.py \
--output-path $BERT_DIR/1/bert-mlm.torchscript --mlm
python3 max/examples/inference/common/bert-torchscript/download-model.py \
--output-path $BERT_DIR/1/bert-mlm.torchscript --mlm

Define the Triton configuration file

In addition to a specific directory structure, NVIDIA's Triton Inference Server also requires a model configuration file that follows a precise format. Let's create that file.

First, create a file, config.pbtxt.

touch $BERT_DIR/config.pbtxt
touch $BERT_DIR/config.pbtxt

Next, open the file and paste the following content.

input {
name: "input_ids"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: [1,128]
}
output {
name: "result0"
data_type: TYPE_FP32
dims: [1, 128, 768]
}
instance_group {
kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"
input {
name: "input_ids"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: [1,128]
}
output {
name: "result0"
data_type: TYPE_FP32
dims: [1, 128, 768]
}
instance_group {
kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"

Essentially, this file tells Triton what inputs the model should expect to receive, and what type of output it should return.

Verify the file and directory structure

Before we run the Docker container, we need to make sure that our file and directory structure is correct. Do that by running the following command:

tree $MODEL_REPOSITORY
tree $MODEL_REPOSITORY

The output of the tree command should be similar to the following. (The root path might be different depending on where you cloned the repository.)

/home/ubuntu/model-repository
└── bert-mlm
├── 1
│ └── bert-mlm.torchscript
└── config.pbtxt

2 directories, 2 files
/home/ubuntu/model-repository
└── bert-mlm
├── 1
│ └── bert-mlm.torchscript
└── config.pbtxt

2 directories, 2 files

If your directory structure resembles the preceding example, then you are ready to run the Docker container! If not, then make the necessary adjustments until the directory structure is correct.

Run the Docker image

In this step, you'll get the latest Docker image for running Triton on MAX and use it to serve the BERT model.

docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
public.ecr.aws/modular/max-serving:latest \
tritonserver --model-repository=/models
docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
public.ecr.aws/modular/max-serving:latest \
tritonserver --model-repository=/models

As the Docker container starts up, MAX Engine compiles the model. After that's done and the server is running, you'll see the usual Triton logs, including the endpoints that are running:

I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

For now, leave this terminal running--we'll use a second terminal to make inference requests.

Send inference requests

We'll use a second terminal to create a script to connect to the Triton Inference Server and send inference requests.

To start, open a second terminal and nagivate to the bert-python-torchscript directory:

cd ~/max/examples/inference/bert-python-torchscript
cd ~/max/examples/inference/bert-python-torchscript

Start up the magic shell environment.

magic shell
magic shell

Next, create a triton-inference.py script.

touch triton-inference.py
touch triton-inference.py

Add the following code to the script.

#!/usr/bin/env python3

import os

# suppress extraneous logging
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from argparse import ArgumentParser

import numpy as np
import torch
import tritonclient.http as httpclient
from transformers import BertTokenizer

BATCH = 1
SEQLEN = 128
DEFAULT_MODEL_NAME = "bert-mlm"
DESCRIPTION = "BERT model"
HF_MODEL_NAME = "bert-base-uncased"


def execute(triton_client, model_name, inputs):
# Set the input data
triton_inputs = [
httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT64"),
httpclient.InferInput(
"attention_mask", inputs["attention_mask"].shape, "INT64"
),
httpclient.InferInput(
"token_type_ids", inputs["token_type_ids"].shape, "INT64"
),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int64))
triton_inputs[1].set_data_from_numpy(
inputs["attention_mask"].astype(np.int64)
)
triton_inputs[2].set_data_from_numpy(
inputs["token_type_ids"].astype(np.int64)
)

print("Executing model...")
results = triton_client.infer(model_name, triton_inputs)
print("Model executed.\n")

return results.as_numpy("result0")


def main():
# Parse args
parser = ArgumentParser(description=DESCRIPTION)
parser.add_argument(
"--input",
type=str,
metavar="str",
required=True,
help="Text with a masked token.",
)
parser.add_argument(
"--model-name",
type=str,
default=DEFAULT_MODEL_NAME,
help="Model name to execute inference.",
)
parser.add_argument(
"--url",
type=str,
required=False,
default="localhost:8000",
help="Inference server URL. Default is localhost:8000.",
)
args = parser.parse_args()
torch.set_default_device("cpu")
# Create a triton client
triton_client = httpclient.InferenceServerClient(url=args.url)

# Preprocess input statement
print("Processing input...")
tokenizer = BertTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(
args.input,
return_tensors="np",
return_token_type_ids=True,
padding="max_length",
truncation=True,
max_length=SEQLEN,
)
print("Input processed.\n")
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
outputs = execute(triton_client, args.model_name, inputs)
logits = torch.from_numpy(outputs[0, masked_index, :])
predicted_token_ids = logits.argmax(dim=-1)
predicted_tokens = [
tokenizer.decode(
[token_id],
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
for token_id in predicted_token_ids
]
filled_mask = "".join(predicted_tokens)
# Get the predictions for the masked token
print(f"input text: {args.input}")
print(f"filled mask: {args.input.replace('[MASK]', filled_mask)}")


if __name__ == "__main__":
main()
#!/usr/bin/env python3

import os

# suppress extraneous logging
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from argparse import ArgumentParser

import numpy as np
import torch
import tritonclient.http as httpclient
from transformers import BertTokenizer

BATCH = 1
SEQLEN = 128
DEFAULT_MODEL_NAME = "bert-mlm"
DESCRIPTION = "BERT model"
HF_MODEL_NAME = "bert-base-uncased"


def execute(triton_client, model_name, inputs):
# Set the input data
triton_inputs = [
httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT64"),
httpclient.InferInput(
"attention_mask", inputs["attention_mask"].shape, "INT64"
),
httpclient.InferInput(
"token_type_ids", inputs["token_type_ids"].shape, "INT64"
),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int64))
triton_inputs[1].set_data_from_numpy(
inputs["attention_mask"].astype(np.int64)
)
triton_inputs[2].set_data_from_numpy(
inputs["token_type_ids"].astype(np.int64)
)

print("Executing model...")
results = triton_client.infer(model_name, triton_inputs)
print("Model executed.\n")

return results.as_numpy("result0")


def main():
# Parse args
parser = ArgumentParser(description=DESCRIPTION)
parser.add_argument(
"--input",
type=str,
metavar="str",
required=True,
help="Text with a masked token.",
)
parser.add_argument(
"--model-name",
type=str,
default=DEFAULT_MODEL_NAME,
help="Model name to execute inference.",
)
parser.add_argument(
"--url",
type=str,
required=False,
default="localhost:8000",
help="Inference server URL. Default is localhost:8000.",
)
args = parser.parse_args()
torch.set_default_device("cpu")
# Create a triton client
triton_client = httpclient.InferenceServerClient(url=args.url)

# Preprocess input statement
print("Processing input...")
tokenizer = BertTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(
args.input,
return_tensors="np",
return_token_type_ids=True,
padding="max_length",
truncation=True,
max_length=SEQLEN,
)
print("Input processed.\n")
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
outputs = execute(triton_client, args.model_name, inputs)
logits = torch.from_numpy(outputs[0, masked_index, :])
predicted_token_ids = logits.argmax(dim=-1)
predicted_tokens = [
tokenizer.decode(
[token_id],
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
for token_id in predicted_token_ids
]
filled_mask = "".join(predicted_tokens)
# Get the predictions for the masked token
print(f"input text: {args.input}")
print(f"filled mask: {args.input.replace('[MASK]', filled_mask)}")


if __name__ == "__main__":
main()

Now, let's return to the home directory and run the script.

cd
cd
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "Paris is the [MASK] of France."
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "Paris is the [MASK] of France."

You should see the results like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.
input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

That's it! You're now running a Triton Server with MAX Engine as the backend. Be sure to try other text inputs, for example:

python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "This is the [MASK] tutorial ever!"
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "This is the [MASK] tutorial ever!"

You can also fetch some metadata about the service:

  • Check if the container is started and ready:

    curl -v localhost:8000/v2/health/ready | python3 -m json.tool
    curl -v localhost:8000/v2/health/ready | python3 -m json.tool
  • Get the model metadata (input/output parameters):

    curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
    curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
  • Get the loaded model configuration:

    curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool
    curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool

Next steps

In this tutorial, you've explored how to build your own Docker image of NVIDIA's Triton Inference Server, using MAX Engine as the inference backend. You can use what you've learned here to build your own Docker images, either using other models included with MAX or custom models you've built yourself.

Did this tutorial work for you?