Serving with NVIDIA's Triton Server
NVIDIA's Triton Inference Server is a common option for many organizations that
want to deploy their inference engines into a production environment. In this
tutorial, you'll learn how to create a Docker container of a Triton Inference
Server that uses MAX Engine as its backend inference engine. This tutorial
focuses on a MAX Engine that uses BERT for masked word prediction; however, you
can apply this tutorial to many of the examples in the max
GitHub repository.
Prerequisitesβ
Before following the steps in this topic, make sure you completed the following:
-
Cloned the
max
repository. You can clone the repository with this command:git clone git@github.com:modularml/max.git
Create a virtual environmentβ
Using a virtual environment ensures that you have the Python version and packages that are compatible with this project. We'll use the Magic CLI to create the environment and install the required packages.
If you don't have Magic, click here.
You can install Magic on macOS and Ubuntu Linux with this command:
curl -ssL https://magic.modular.com | bash
Then run the source
command printed in your terminal.
-
Navigate into the BERT Python code example.
cd max/examples/inference/bert-python-torchscript
-
Now start a shell in the environment and see your MAX version:
magic shell
python3 -c 'from max import engine; print(engine.__version__)'
Create your directory structureβ
Triton has very specific requirements when it comes to where models are stored. Your next action, then, is to create that directory structure.
First, create an environment variable for the model repository.
MODEL_REPOSITORY=~/model-repository
You also need a directory for the BERT model.
BERT_DIR=$MODEL_REPOSITORY/bert-mlm
Then, you need a directory for the model itself:
mkdir -p $BERT_DIR/1
Download the modelβ
With the directory structure in place, you can now download the model. Because this model uses PyTorch, we need to convert the model into a TorchScript format. To make this step easier, we provide you with a script to perform these tasks.
cd # Or whatever directory you installed max!
python3 max/examples/inference/common/bert-torchscript/download-model.py \
--output-path $BERT_DIR/1/bert-mlm.torchscript --mlm
This script handles downloading the model and converting it to TorchScript. When you work with your own PyTorch model, you'll need to convert it to TorchScript as well. You can learn how by reading PyTorch's Introduction to TorchScript.
Define the Triton configuration fileβ
In addition to a specific directory structure, NVIDIA's Triton Inference Server also requires a model configuration file that follows a precise format. Let's create that file.
First, create a file, config.pbtxt
.
touch $BERT_DIR/config.pbtxt
Next, open the file and paste the following content.
input {
name: "input_ids"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: [1,128]
}
output {
name: "result0"
data_type: TYPE_FP32
dims: [1, 128, 768]
}
instance_group {
kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"
Essentially, this file tells Triton what inputs the model should expect to receive, and what type of output it should return.
Verify the file and directory structureβ
We're almost ready to build the Docker image. But first we need to make sure that our file and directory structure is correct. We can do that by running the following command:
tree $MODEL_REPOSITORY
The output of the tree
command should be similar to the following. (The root
path might be different depending on where you cloned the repository.)
/home/ubuntu/model-repository
βββ bert-mlm
βββ 1
β βββ bert-mlm.torchscript
βββ config.pbtxt
2 directories, 2 files
If your directory structure resembles the preceding example, then you are ready to build the Docker image! If not, then make the necessary adjustments until the directory structure is correct.
Run the Docker imageβ
We've built a Docker image that you can use locally. Let's get it running!
docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
public.ecr.aws/modular/max-serving:latest \
tritonserver --model-repository=/models
As the Docker container starts up, MAX Engine compiles the model. After thatβs done and the server is running, youβll see the usual Triton logs, including the endpoints that are running:
I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
For now, leave this terminal running--we'll use a second terminal to make inference requests.
Send inference requestsβ
We'll use a second terminal to create a script to connect to the Triton Inference Server and send inference requests.
To start, open a second terminal and nagivate to the bert-python-torchscript
directory:
cd ~/max/examples/inference/bert-python-torchscript
Start up the magic shell
environment.
magic shell
Next, create a triton-inference.py
script.
touch triton-inference.py
Add the following code to the script.
#!/usr/bin/env python3
import os
# suppress extraneous logging
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from argparse import ArgumentParser
import numpy as np
import torch
import tritonclient.http as httpclient
from transformers import BertTokenizer
BATCH = 1
SEQLEN = 128
DEFAULT_MODEL_NAME = "bert-mlm"
DESCRIPTION = "BERT model"
HF_MODEL_NAME = "bert-base-uncased"
def execute(triton_client, model_name, inputs):
# Set the input data
triton_inputs = [
httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT64"),
httpclient.InferInput(
"attention_mask", inputs["attention_mask"].shape, "INT64"
),
httpclient.InferInput(
"token_type_ids", inputs["token_type_ids"].shape, "INT64"
),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int64))
triton_inputs[1].set_data_from_numpy(
inputs["attention_mask"].astype(np.int64)
)
triton_inputs[2].set_data_from_numpy(
inputs["token_type_ids"].astype(np.int64)
)
print("Executing model...")
results = triton_client.infer(model_name, triton_inputs)
print("Model executed.\n")
return results.as_numpy("result0")
def main():
# Parse args
parser = ArgumentParser(description=DESCRIPTION)
parser.add_argument(
"--input",
type=str,
metavar="str",
required=True,
help="Text with a masked token.",
)
parser.add_argument(
"--model-name",
type=str,
default=DEFAULT_MODEL_NAME,
help="Model name to execute inference.",
)
parser.add_argument(
"--url",
type=str,
required=False,
default="localhost:8000",
help="Inference server URL. Default is localhost:8000.",
)
args = parser.parse_args()
torch.set_default_device("cpu")
# Create a triton client
triton_client = httpclient.InferenceServerClient(url=args.url)
# Preprocess input statement
print("Processing input...")
tokenizer = BertTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(
args.input,
return_tensors="np",
return_token_type_ids=True,
padding="max_length",
truncation=True,
max_length=SEQLEN,
)
print("Input processed.\n")
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
outputs = execute(triton_client, args.model_name, inputs)
logits = torch.from_numpy(outputs[0, masked_index, :])
predicted_token_ids = logits.argmax(dim=-1)
predicted_tokens = [
tokenizer.decode(
[token_id],
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
for token_id in predicted_token_ids
]
filled_mask = "".join(predicted_tokens)
# Get the predictions for the masked token
print(f"input text: {args.input}")
print(f"filled mask: {args.input.replace('[MASK]', filled_mask)}")
if __name__ == "__main__":
main()
Now, let's return to the home directory and run the script.
cd
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "Paris is the [MASK] of France."
You should see the results like this:
input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.
That's it! You're now running a Triton Server with MAX Engine as the backend. Be sure to try other text inputs, for example:
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "This is the [MASK] tutorial ever!"
You can also fetch some metadata about the service:
-
Check if the container is started and ready:
curl -v localhost:8000/v2/health/ready | python3 -m json.tool
-
Get the model metadata (input/output parameters):
curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
-
Get the loaded model configuration:
curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool
Next stepsβ
In this tutorial, you've explored how to build your own Docker image of NVIDIA's Triton Inference Server, using MAX Engine as the inference backend. You can use what you've learned here to build your own Docker images, either using other models included with MAX or custom models you've built yourself.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
If you'd like to share more information, please report an issue on GitHub
π What went wrong?