Deploy a TorchScript model locally with NVIDIA's Triton Server
NVIDIA's Triton Inference Server is a common option for many organizations that
want to deploy their inference engines into a production environment. In this
tutorial, you'll learn how to locally deploy a Docker container of a Triton
Inference Server that uses MAX Engine as its backend inference engine. This
tutorial uses BERT for masked word prediction; however, you can apply this
tutorial to many of the examples in the max
GitHub repository.
Prerequisites
Before following the steps in this topic, make sure you completed the following:
-
Cloned the
max
repository. You can clone the repository with this command:git clone git@github.com:modularml/max.git
git clone git@github.com:modularml/max.git
Create a virtual environment
Using a virtual environment ensures that you have the Python version and packages that are compatible with this project. We'll use the Magic CLI to create the environment and install the required packages.
-
Navigate into the BERT Python code example.
cd max/examples/inference/bert-python-torchscript
cd max/examples/inference/bert-python-torchscript
-
Now start a shell in the environment and see your MAX version:
magic shell
magic shell
python3 -c 'from max import engine; print(engine.__version__)'
python3 -c 'from max import engine; print(engine.__version__)'
Create your directory structure
Triton has very specific requirements when it comes to where models are stored. Your next action, then, is to create that directory structure.
First, create an environment variable for the model repository.
MODEL_REPOSITORY=~/model-repository
MODEL_REPOSITORY=~/model-repository
You also need a directory for the BERT model.
BERT_DIR=$MODEL_REPOSITORY/bert-mlm
BERT_DIR=$MODEL_REPOSITORY/bert-mlm
Then, you need a directory for the model itself:
mkdir -p $BERT_DIR/1
mkdir -p $BERT_DIR/1
Download the model
With the directory structure in place, you can now download the model. Because this model uses PyTorch, we need to convert the model into a TorchScript format. To make this step easier, we provide you with a script to perform these tasks.
cd # Or whatever directory you installed max!
cd # Or whatever directory you installed max!
python3 max/examples/inference/common/bert-torchscript/download-model.py \
--output-path $BERT_DIR/1/bert-mlm.torchscript --mlm
python3 max/examples/inference/common/bert-torchscript/download-model.py \
--output-path $BERT_DIR/1/bert-mlm.torchscript --mlm
Define the Triton configuration file
In addition to a specific directory structure, NVIDIA's Triton Inference Server also requires a model configuration file that follows a precise format. Let's create that file.
First, create a file, config.pbtxt
.
touch $BERT_DIR/config.pbtxt
touch $BERT_DIR/config.pbtxt
Next, open the file and paste the following content.
input {
name: "input_ids"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: [1,128]
}
output {
name: "result0"
data_type: TYPE_FP32
dims: [1, 128, 768]
}
instance_group {
kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"
input {
name: "input_ids"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: [1,128]
}
output {
name: "result0"
data_type: TYPE_FP32
dims: [1, 128, 768]
}
instance_group {
kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"
Essentially, this file tells Triton what inputs the model should expect to receive, and what type of output it should return.
Verify the file and directory structure
Before we run the Docker container, we need to make sure that our file and directory structure is correct. Do that by running the following command:
tree $MODEL_REPOSITORY
tree $MODEL_REPOSITORY
The output of the tree
command should be similar to the following. (The root
path might be different depending on where you cloned the repository.)
/home/ubuntu/model-repository
└── bert-mlm
├── 1
│ └── bert-mlm.torchscript
└── config.pbtxt
2 directories, 2 files
/home/ubuntu/model-repository
└── bert-mlm
├── 1
│ └── bert-mlm.torchscript
└── config.pbtxt
2 directories, 2 files
If your directory structure resembles the preceding example, then you are ready to run the Docker container! If not, then make the necessary adjustments until the directory structure is correct.
Run the Docker image
In this step, you'll get the latest Docker image for running Triton on MAX and use it to serve the BERT model.
docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
public.ecr.aws/modular/max-serving:latest \
tritonserver --model-repository=/models
docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
public.ecr.aws/modular/max-serving:latest \
tritonserver --model-repository=/models
As the Docker container starts up, MAX Engine compiles the model. After that's done and the server is running, you'll see the usual Triton logs, including the endpoints that are running:
I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
For now, leave this terminal running--we'll use a second terminal to make inference requests.
Send inference requests
We'll use a second terminal to create a script to connect to the Triton Inference Server and send inference requests.
To start, open a second terminal and nagivate to the bert-python-torchscript
directory:
cd ~/max/examples/inference/bert-python-torchscript
cd ~/max/examples/inference/bert-python-torchscript
Start up the magic shell
environment.
magic shell
magic shell
Next, create a triton-inference.py
script.
touch triton-inference.py
touch triton-inference.py
Add the following code to the script.
#!/usr/bin/env python3
import os
# suppress extraneous logging
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from argparse import ArgumentParser
import numpy as np
import torch
import tritonclient.http as httpclient
from transformers import BertTokenizer
BATCH = 1
SEQLEN = 128
DEFAULT_MODEL_NAME = "bert-mlm"
DESCRIPTION = "BERT model"
HF_MODEL_NAME = "bert-base-uncased"
def execute(triton_client, model_name, inputs):
# Set the input data
triton_inputs = [
httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT64"),
httpclient.InferInput(
"attention_mask", inputs["attention_mask"].shape, "INT64"
),
httpclient.InferInput(
"token_type_ids", inputs["token_type_ids"].shape, "INT64"
),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int64))
triton_inputs[1].set_data_from_numpy(
inputs["attention_mask"].astype(np.int64)
)
triton_inputs[2].set_data_from_numpy(
inputs["token_type_ids"].astype(np.int64)
)
print("Executing model...")
results = triton_client.infer(model_name, triton_inputs)
print("Model executed.\n")
return results.as_numpy("result0")
def main():
# Parse args
parser = ArgumentParser(description=DESCRIPTION)
parser.add_argument(
"--input",
type=str,
metavar="str",
required=True,
help="Text with a masked token.",
)
parser.add_argument(
"--model-name",
type=str,
default=DEFAULT_MODEL_NAME,
help="Model name to execute inference.",
)
parser.add_argument(
"--url",
type=str,
required=False,
default="localhost:8000",
help="Inference server URL. Default is localhost:8000.",
)
args = parser.parse_args()
torch.set_default_device("cpu")
# Create a triton client
triton_client = httpclient.InferenceServerClient(url=args.url)
# Preprocess input statement
print("Processing input...")
tokenizer = BertTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(
args.input,
return_tensors="np",
return_token_type_ids=True,
padding="max_length",
truncation=True,
max_length=SEQLEN,
)
print("Input processed.\n")
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
outputs = execute(triton_client, args.model_name, inputs)
logits = torch.from_numpy(outputs[0, masked_index, :])
predicted_token_ids = logits.argmax(dim=-1)
predicted_tokens = [
tokenizer.decode(
[token_id],
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
for token_id in predicted_token_ids
]
filled_mask = "".join(predicted_tokens)
# Get the predictions for the masked token
print(f"input text: {args.input}")
print(f"filled mask: {args.input.replace('[MASK]', filled_mask)}")
if __name__ == "__main__":
main()
#!/usr/bin/env python3
import os
# suppress extraneous logging
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from argparse import ArgumentParser
import numpy as np
import torch
import tritonclient.http as httpclient
from transformers import BertTokenizer
BATCH = 1
SEQLEN = 128
DEFAULT_MODEL_NAME = "bert-mlm"
DESCRIPTION = "BERT model"
HF_MODEL_NAME = "bert-base-uncased"
def execute(triton_client, model_name, inputs):
# Set the input data
triton_inputs = [
httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT64"),
httpclient.InferInput(
"attention_mask", inputs["attention_mask"].shape, "INT64"
),
httpclient.InferInput(
"token_type_ids", inputs["token_type_ids"].shape, "INT64"
),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int64))
triton_inputs[1].set_data_from_numpy(
inputs["attention_mask"].astype(np.int64)
)
triton_inputs[2].set_data_from_numpy(
inputs["token_type_ids"].astype(np.int64)
)
print("Executing model...")
results = triton_client.infer(model_name, triton_inputs)
print("Model executed.\n")
return results.as_numpy("result0")
def main():
# Parse args
parser = ArgumentParser(description=DESCRIPTION)
parser.add_argument(
"--input",
type=str,
metavar="str",
required=True,
help="Text with a masked token.",
)
parser.add_argument(
"--model-name",
type=str,
default=DEFAULT_MODEL_NAME,
help="Model name to execute inference.",
)
parser.add_argument(
"--url",
type=str,
required=False,
default="localhost:8000",
help="Inference server URL. Default is localhost:8000.",
)
args = parser.parse_args()
torch.set_default_device("cpu")
# Create a triton client
triton_client = httpclient.InferenceServerClient(url=args.url)
# Preprocess input statement
print("Processing input...")
tokenizer = BertTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(
args.input,
return_tensors="np",
return_token_type_ids=True,
padding="max_length",
truncation=True,
max_length=SEQLEN,
)
print("Input processed.\n")
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
outputs = execute(triton_client, args.model_name, inputs)
logits = torch.from_numpy(outputs[0, masked_index, :])
predicted_token_ids = logits.argmax(dim=-1)
predicted_tokens = [
tokenizer.decode(
[token_id],
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
for token_id in predicted_token_ids
]
filled_mask = "".join(predicted_tokens)
# Get the predictions for the masked token
print(f"input text: {args.input}")
print(f"filled mask: {args.input.replace('[MASK]', filled_mask)}")
if __name__ == "__main__":
main()
Now, let's return to the home directory and run the script.
cd
cd
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "Paris is the [MASK] of France."
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "Paris is the [MASK] of France."
You should see the results like this:
input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.
input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.
That's it! You're now running a Triton Server with MAX Engine as the backend. Be sure to try other text inputs, for example:
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "This is the [MASK] tutorial ever!"
python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--input "This is the [MASK] tutorial ever!"
You can also fetch some metadata about the service:
-
Check if the container is started and ready:
curl -v localhost:8000/v2/health/ready | python3 -m json.tool
curl -v localhost:8000/v2/health/ready | python3 -m json.tool
-
Get the model metadata (input/output parameters):
curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
-
Get the loaded model configuration:
curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool
curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool
Next steps
In this tutorial, you've explored how to build your own Docker image of NVIDIA's Triton Inference Server, using MAX Engine as the inference backend. You can use what you've learned here to build your own Docker images, either using other models included with MAX or custom models you've built yourself.
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!