Skip to main content

Serving with NVIDIA's Triton Server

NVIDIDA's Triton Server is a serving framework that simplifies deploying AI models at scale in production environments. It provides a consistent and optimized interface for deploying models trained with various frameworks like PyTorch and ONNX. Triton Server offers features like model versioning, multi-model deployment, and support for various deployment environments, making it a versatile solution for AI inference deployment.

This page shows you how to try serving MAX Engine with Triton by running it in a Docker container, and then send inference requests with a Python program.

Before you begin, be sure you install the latest version of MAX.

note

Currently, MAX is available only for local development. It is not currently licensed for production deployment.

Deploy an example

Here’s quick-start guide to run an inference with MAX Engine and Triton Server on a local system.

We’ve created a Docker container that includes both Triton and MAX Engine, which you can either download and run, or build yourself from a Dockerfile. After the Docker container is running, you can send inference requests from an HTTP/gRPC client via tritonclient, as you'll see in our example.

To make it as simple as possible to run, we’ve created an example script that downloads and runs the Docker container, and then sends an inference request with a Python program. All you need to do is run a bash script.

First, clone the code examples (if you haven’t already):

git clone git@github.com:modularml/max.git

Then, run one of the deploy.sh scripts. For example, here’s how to run an inference with a BERT model:

  1. Install the Python requirements:

    cd max/examples/inference/bert-python-torchscript
    python3 -m pip install -r requirements.txt
  2. Run the Docker container and send an inference request:

    bash deploy.sh

It might take a few minutes for the model to compile. But once it's done, execution is nearly instant, and subsequent loads will be faster.

You should see output like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

Deploy your own Docker image

If you look in the deploy.sh script, you’ll see that it builds a Docker image that's hosted in the following container registry, which is always the latest version of MAX:

public.ecr.aws/modular/max-serving

However, you might want to customize this image yourself. In that case, you can instead build the Docker image using the Dockerfile that’s included in the MAX SDK, which will always correspond to the version of MAX that you have installed.

The following sections describe how to do that, run it, and send requests.

Get the model

First, create a directory for your model that you can mount inside the container.

For example, let’s use the BERT model from our GitHub examples.

If you didn’t already clone the max repo, do that now:

git clone https://github.com/modularml/max.git

Then, with your terminal in that same location, download the model from HuggingFace with our Python script, which also converts the model to TorchScript format, and save it in the path that you’ll mount in the container:

MODEL_REPOSITORY=~/model-repository \
&& BERT_DIR=$MODEL_REPOSITORY/bert-mlm \
&& mkdir -p $BERT_DIR/1
python3 max/examples/inference/common/bert-torchscript/download-model.py \
--output-path $BERT_DIR/1/bert-mlm.torchscript --mlm

Add the Triton config

Next, you need to add a model configuration file.

For example, here's the configuration file for the BERT model:

input {
name: "input_ids"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: [1,128]
}
output {
name: "result0"
data_type: TYPE_FP32
dims: [1, 128, 768]
}
instance_group {
kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"

You can copy this from our repo into your project with this command:

cp max/examples/inference/bert-python-torchscript/config.pbtxt $BERT_DIR

Now make sure all the files are in the model repository:

tree $MODEL_REPOSITORY
/home/ubuntu/model-repository
└── bert-mlm
├── 1
│   └── bert-mlm.torchscript
└── config.pbtxt

2 directories, 2 files

Build the Docker image

When you installed the MAX SDK, it saved the Dockerfile for MAX Engine with Triton on your system. Here's how you can build it:

MAX_INSTALL_DIR=$(modular config max.path)
docker buildx build \
--file ${MAX_INSTALL_DIR}/Dockerfile \
--tag max_serving_local \
--load \
${MAX_INSTALL_DIR}

Run the Docker image

Now start the container and Triton with this command:

docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
max_serving_local \
tritonserver --model-repository=/models \
--model-control-mode=explicit \
--load-model=bert-mlm

As the Docker container starts up, MAX Engine will compile the model. Once that’s done and the server is running, you’ll see the usual Triton logs, including the endpoints that are running:

I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Leave this terminal running.

Send inference requests

Open a second terminal on the same machine and send an inference request with this example client:

python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--text "Paris is the [MASK] of France."

You should see the results like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

That's it! You're now running a Triton Server with MAX Engine as the backend.

You can also fetch some metadata about the service:

  • Check if the container is started and ready:

    curl -v localhost:8000/v2/health/ready | python3 -m json.tool
  • Get the model metadata (input/output parameters):

    curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
  • Get the loaded model configuration:

    curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool

This page has been a brief walkthrough of serving MAX Engine using Nvidia's Triton Server. This implementation is not commercially-licensed for production deployment. However, you can freely tinker with it and evaluate it for your AI usecases.

For a more turn-key solution, we'll also make MAX available on AWS Marketplace, so you can quickly deploy a managed container on AWS.

All of this and more is coming soon, when we release MAX for production workloads. Sign up for updates.