Skip to main content
Log in

Serving with NVIDIA's Triton Server

NVIDIA's Triton Inference Server is a common option for many organizations that want to deploy their inference engines into a production environment. In this tutorial, you'll learn how to create a Docker container of a Triton Inference Server that uses MAX Engine as its backend inference engine. This tutorial focuses on a MAX Engine that uses BERT for masked word prediction; however, you can apply this tutorial to many of the examples in the max GitHub repository.

Prerequisites​

Before following the steps in this topic, make sure you completed the following:

  • Cloned the max repository. You can clone the repository with this command:

    git clone git@github.com:modularml/max.git
  • Installed Docker.

Create a virtual environment​

Using a virtual environment ensures that you have the Python version and packages that are compatible with this project. We'll use the Magic CLI to create the environment and install the required packages.

If you don't have Magic, click here.

You can install Magic on macOS and Ubuntu Linux with this command:

curl -ssL https://magic.modular.com | bash

Then run the source command printed in your terminal.

  1. Navigate into the BERT Python code example.

    cd max/examples/inference/bert-python-torchscript
  2. Now start a shell in the environment and see your MAX version:

    magic shell
    python3 -c 'from max import engine; print(engine.__version__)'

Create your directory structure​

Triton has very specific requirements when it comes to where models are stored. Your next action, then, is to create that directory structure.

First, create an environment variable for the model repository.

MODEL_REPOSITORY=~/model-repository

You also need a directory for the BERT model.

BERT_DIR=$MODEL_REPOSITORY/bert-mlm

Then, you need a directory for the model itself:

mkdir -p $BERT_DIR/1

Download the model​

With the directory structure in place, you can now download the model. Because this model uses PyTorch, we need to convert the model into a TorchScript format. To make this step easier, we provide you with a script to perform these tasks.

cd # Or whatever directory you installed max!
python3 max/examples/inference/common/bert-torchscript/download-model.py \
--output-path $BERT_DIR/1/bert-mlm.torchscript --mlm
note

This script handles downloading the model and converting it to TorchScript. When you work with your own PyTorch model, you'll need to convert it to TorchScript as well. You can learn how by reading PyTorch's Introduction to TorchScript.

Define the Triton configuration file​

In addition to a specific directory structure, NVIDIA's Triton Inference Server also requires a model configuration file that follows a precise format. Let's create that file.

First, create a file, config.pbtxt.

touch $BERT_DIR/config.pbtxt

Next, open the file and paste the following content.

input {
name: "input_ids"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: [1,128]
}
output {
name: "result0"
data_type: TYPE_FP32
dims: [1, 128, 768]
}
instance_group {
kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"

Essentially, this file tells Triton what inputs the model should expect to receive, and what type of output it should return.

Verify the file and directory structure​

We're almost ready to build the Docker image. But first we need to make sure that our file and directory structure is correct. We can do that by running the following command:

tree $MODEL_REPOSITORY

The output of the tree command should be similar to the following. (The root path might be different depending on where you cloned the repository.)

/home/ubuntu/model-repository
└── bert-mlm
β”œβ”€β”€ 1
β”‚ └── bert-mlm.torchscript
└── config.pbtxt

2 directories, 2 files

If your directory structure resembles the preceding example, then you are ready to build the Docker image! If not, then make the necessary adjustments until the directory structure is correct.

Run the Docker image​

We've built a Docker image that you can use locally. Let's get it running!

docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
public.ecr.aws/modular/max-serving:latest \
tritonserver --model-repository=/models

As the Docker container starts up, MAX Engine compiles the model. After that’s done and the server is running, you’ll see the usual Triton logs, including the endpoints that are running:

I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

For now, leave this terminal running--we'll use a second terminal to make inference requests.

Send inference requests​

We'll use a second terminal to create a script to connect to the Triton Inference Server and send inference requests.

To start, open a second terminal and nagivate to the bert-python-torchscript directory:

cd ~/max/examples/inference/bert-python-torchscript

Start up the magic shell environment.

magic shell

Next, create a triton-inference.py script.

touch triton-inference.py

Add the following code to the script.

#!/usr/bin/env python3

import os

# suppress extraneous logging
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from argparse import ArgumentParser

import numpy as np
import torch
import tritonclient.http as httpclient
from transformers import BertTokenizer

BATCH = 1
SEQLEN = 128
DEFAULT_MODEL_NAME = "bert-mlm"
DESCRIPTION = "BERT model"
HF_MODEL_NAME = "bert-base-uncased"


def execute(triton_client, model_name, inputs):
# Set the input data
triton_inputs = [
httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT64"),
httpclient.InferInput(
"attention_mask", inputs["attention_mask"].shape, "INT64"
),
httpclient.InferInput(
"token_type_ids", inputs["token_type_ids"].shape, "INT64"
),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int64))
triton_inputs[1].set_data_from_numpy(
inputs["attention_mask"].astype(np.int64)
)
triton_inputs[2].set_data_from_numpy(
inputs["token_type_ids"].astype(np.int64)
)

print("Executing model...")
results = triton_client.infer(model_name, triton_inputs)
print("Model executed.\n")

return results.as_numpy("result0")


def main():
# Parse args
parser = ArgumentParser(description=DESCRIPTION)
parser.add_argument(
"--input",
type=str,
metavar="str",
required=True,
help="Text with a masked token.",
)
parser.add_argument(
"--model-name",
type=str,
default=DEFAULT_MODEL_NAME,
help="Model name to execute inference.",
)
parser.add_argument(
"--url",
type=str,
required=False,
default="localhost:8000",
help="Inference server URL. Default is localhost:8000.",
)
args = parser.parse_args()
torch.set_default_device("cpu")
# Create a triton client
triton_client = httpclient.InferenceServerClient(url=args.url)

# Preprocess input statement
print("Processing input...")
tokenizer = BertTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(
args.input,
return_tensors="np",
return_token_type_ids=True,
padding="max_length",
truncation=True,
max_length=SEQLEN,
)
print("Input processed.\n")
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
outputs = execute(triton_client, args.model_name, inputs)
logits = torch.from_numpy(outputs[0, masked_index, :])
predicted_token_ids = logits.argmax(dim=-1)
predicted_tokens = [
tokenizer.decode(
[token_id],
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
for token_id in predicted_token_ids
]
filled_mask = "".join(predicted_tokens)
# Get the predictions for the masked token
print(f"input text: {args.input}")
print(f"filled mask: {args.input.replace('[MASK]', filled_mask)}")


if __name__ == "__main__":
main()

Now, let's return to the home directory and run the script.

cd
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "Paris is the [MASK] of France."

You should see the results like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

That's it! You're now running a Triton Server with MAX Engine as the backend. Be sure to try other text inputs, for example:

python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "This is the [MASK] tutorial ever!"

You can also fetch some metadata about the service:

  • Check if the container is started and ready:

    curl -v localhost:8000/v2/health/ready | python3 -m json.tool
  • Get the model metadata (input/output parameters):

    curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
  • Get the loaded model configuration:

    curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool

Next steps​

In this tutorial, you've explored how to build your own Docker image of NVIDIA's Triton Inference Server, using MAX Engine as the inference backend. You can use what you've learned here to build your own Docker images, either using other models included with MAX or custom models you've built yourself.

Was this page helpful?