Skip to main content

Get started with MAX Serving

Welcome to the MAX Serving trial guide!

MAX Serving is essentially a wrapper around MAX Engine to help you deploy AI models as a service. It adds features that serve your AI models in a production environment and responds to inference requests from client programs. You can learn more about it in the MAX Serving introduction.

This page shows you how to try MAX Serving by running it in a Docker container, and then send inference requests with a Python program.

note

Currently, MAX Serving is available only for local development, as part of the MAX preview. It is not currently licensed for production deployment, but that's coming soon.

Deploy an example

Here’s quick-start guide to run an inference with MAX Serving on your system.

1. Install the MAX SDK

If you haven't already done so, follow the instructions to install the MAX SDK, and then return here.

2. Run a Docker example

We’ve created a Docker container that includes MAX Serving (Triton and MAX Engine), which you can either download and run, or build yourself from a Dockerfile. Once the Docker container is running, you can send inference requests from an HTTP/gRPC client via tritonclient, as you'll see in our example.

To make it as simple as possible to run, we’ve created an example script that downloads and runs the Docker container, and then sends an inference request with a Python program. All you need to do, is run a bash script.

First, clone the code examples (if you haven’t already):

git clone git@github.com:modularml/max.git

Then, run one of the deploy.sh scripts. For example, here’s how to run an inference with a BERT model:

  1. Install the Python requirements:

    cd max/examples/inference/bert-python-torchscript
    python3 -m pip install -r requirements.txt
  2. Run the Docker container and send an inference request:

    bash deploy.sh

It might take a few minutes for the model to compile. But once it's done, execution is nearly instant, and subsequent loads will be faster.

You should see output like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

Deploy your own Docker image

If you look in the deploy.sh script, you’ll see that it builds a Docker image that's hosted in the following container registry, which is always the latest version of MAX Serving:

public.ecr.aws/modular/max-serving

However, you might want to customize this image yourself. In that case, you can instead build the Docker image using the Dockerfile that’s included in the MAX SDK, which will always correspond to the version of MAX that you have installed.

The following sections describe how to do that, run it, and send requests.

1. Get the model

First, create a directory for your model that you can mount inside the container.

For example, let’s use the BERT model from our GitHub examples.

If you didn’t already clone the max repo, do that now:

git clone https://github.com/modularml/max.git

Then, with your terminal in that same location, download the model from HuggingFace with our Python script, which also converts the model to TorchScript format, and save it in the path that you’ll mount in the container:

MODEL_REPOSITORY=~/model-repository \
&& BERT_DIR=$MODEL_REPOSITORY/bert-mlm \
&& mkdir -p $BERT_DIR/1
python3 max/examples/inference/common/bert-torchscript/download-model.py \
--output-path $BERT_DIR/1/bert-mlm.torchscript --mlm

2. Add the Triton config

Currently, MAX Serving is implemented with NVIDIA Triton Inference Server, which requires a model configuration file. For example, here's the config for the BERT model:

input {
name: "input_ids"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: [1,128]
}
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: [1,128]
}
output {
name: "result0"
data_type: TYPE_FP32
dims: [1, 128, 768]
}
instance_group {
kind: KIND_CPU
}
backend: "max"
default_model_filename: "bert-mlm.torchscript"

You can copy this from our repo into your project with this command:

cp max/examples/inference/bert-python-torchscript/config.pbtxt $BERT_DIR

Now make sure all the files are in the model repository:

tree $MODEL_REPOSITORY
/home/ubuntu/model-repository
└── bert-mlm
├── 1
│   └── bert-mlm.torchscript
└── config.pbtxt

2 directories, 2 files

3. Build the Docker image

When you installed the MAX SDK, it saved the Dockerfile for MAX Serving on your system. Here's how you can build it:

MAX_INSTALL_DIR=$(modular config max.path)
docker buildx build \
--file ${MAX_INSTALL_DIR}/Dockerfile \
--tag max_serving_local \
--load \
${MAX_INSTALL_DIR}

4. Run the Docker image

Now start the container and Triton with this command:

docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
max_serving_local \
tritonserver --model-repository=/models \
--model-control-mode=explicit \
--load-model=bert-mlm

As the Docker container starts up, MAX Engine will compile the model. Once that’s done and the server is running, you’ll see the usual Triton logs, including the endpoints that are running:

I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Leave this terminal running.

5. Send inference requests

Open a second terminal on the same machine and send an inference request with this example client:

python3 max/examples/inference/bert-python-torchscript/triton-inference.py \
--text "Paris is the [MASK] of France."

You should see the results like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

That's it! You're now running MAX Serving (a service with a MAX Engine backend).

You can also fetch some metadata about the service:

  • Check if the container is started and ready:

    curl -v localhost:8000/v2/health/ready | python3 -m json.tool
  • Get the model metadata (input/output parameters):

    curl localhost:8000/v2/models/bert-mlm | python3 -m json.tool
  • Get the loaded model configuration:

    curl localhost:8000/v2/models/bert-mlm/config | python3 -m json.tool

This page has been a brief walkthrough of MAX Serving, which is currently not commercially-licensed for production deployment. However, you can freely tinker with it and evaluate it for your AI usecases.

This version is also built only for Triton Inference Server, but our plan is to make MAX Serving available for all leading serving frameworks (NVIDIA Triton, Ray Serve, TF Serving, and others). We’ll also release tools and APIs that allow you to build a custom serving container.

For a more turn-key solution, we'll also make MAX Serving available on AWS Marketplace, so you can quickly deploy a managed container on AWS.

All of this and more is coming soon, when we release MAX for production workloads. Sign up for updates.