Skip to main content

Get started with MAX Serving

Welcome to the MAX Serving trial guide!

MAX Serving is essentially a wrapper around MAX Engine to help you deploy AI models as a service. It adds features that serve your AI models in a production environment and responds to inference requests from client programs. You can learn more about it in the MAX Serving introduction.

This page shows you how to try MAX Serving by running it in a Docker container, and then send inference requests with a Python program.

note

Currently, MAX Serving is available only for local development, as part of the MAX Developer Edition. It is not currently licensed for production deployment, but that's coming soon in the Enterprise Edition.

Deploy an example

Here’s quick-start guide to run an inference with MAX Serving on your system.

1. Install the MAX SDK

If you haven't already done so, follow the instructions to install the MAX SDK, and then return here.

2. Run a Docker example

We’ve created a Docker container that includes MAX Serving (Triton and MAX Engine), which you can either download and run, or build yourself from a Dockerfile. Once the Docker container is running, you can send inference requests from an HTTP/gRPC client via tritonclient, as you'll see in our example.

To make it as simple as possible to run, we’ve created an example script that downloads and runs the Docker container, and then sends an inference request with a Python program. All you need to do, is run a bash script.

First, clone the code examples (if you haven’t already):

git clone git@github.com:modularml/max.git

Then, run one of the deploy.sh scripts. For example, here’s how to run an inference with the RoBERTa model:

  1. Install the Python requirements:

    cd max/examples/inference/roberta-python-tensorflow
    python3 -m pip install -r requirements.txt
  2. Run the Docker container and send an inference request:

    bash deploy.sh

It might take a few minutes for the model to compile. But once it's done, execution is nearly instant, and subsequent loads will be faster.

Deploy your own Docker image

If you look in the deploy.sh script, you’ll see that it builds a Docker image that's hosted in the following container registry, which is always the latest version of MAX Serving:

public.ecr.aws/modular/max-serving-de

However, you might want to customize this image yourself. In that case, you can instead build the Docker image using the Dockerfile that’s included in the MAX SDK, which will always correspond to the version of MAX that you have installed.

The following sections describe how to do that, run it, and send requests.

1. Get the model

First, create a directory for your model that you can mount inside the container.

For example, let’s use the RoBERTa model from our GitHub examples.

If you didn’t already clone the max repo, do that now:

git clone https://github.com/modularml/max.git

Then, with your terminal in that same location, download the model from HuggingFace with our Python script (this also converts the model to TensorFlow SavedModel format) and save it in the path that you’ll mount in the container:

MODEL_REPOSITORY=~/model-repository \
&& ROBERTA_DIR=$MODEL_REPOSITORY/roberta \
&& mkdir -p $ROBERTA_DIR
python3 max/examples/inference/roberta-python-tensorflow/download-model.py \
--output-dir $ROBERTA_DIR/1/model.savedmodel

2. Add the Triton config

Currently, MAX Serving is implemented with NVIDIA Triton Inference Server, which requires a model configuration file. For example, this is what the config looks like for the RoBERTa model:

instance_group {
kind: KIND_CPU
}
default_model_filename: "model.savedmodel"
backend: "max"

You can simply copy this from our repo into your project like this:

cp max/examples/inference/roberta-python-tensorflow/config.pbtxt $ROBERTA_DIR

Now all your files are ready and should look like this:

tree $MODEL_REPOSITORY
/home/ubuntu/model-repository
└── roberta
├── 1
│   └── model.savedmodel
│   ├── assets
│   ├── fingerprint.pb
│   ├── saved_model.pb
│   └── variables
│   ├── variables.data-00000-of-00001
│   └── variables.index
└── config.pbtxt

5 directories, 5 files

3. Build the Docker image

You can find the Docker file for MAX Serving in the path provided by modular config max.path. Thus, here’s how you can build the Docker container based on your installed MAX version:

MAX_INSTALL_DIR=$(modular config max.path)
docker buildx build \
--build-arg MAX_INSTALL_DIR=${MAX_INSTALL_DIR} \
--file ${MAX_INSTALL_DIR}/Dockerfile \
--tag max_serving_local \
--load \
${MAX_INSTALL_DIR}

4. Run the Docker image

Now start the container and Triton with this command:

docker run -it --rm --net=host \
-v $MODEL_REPOSITORY:/models \
max_serving_local \
tritonserver --model-repository=/models \
--model-control-mode=explicit \
--load-model=roberta

As the Docker container starts up, MAX Engine will compile the model. Once that’s done and the server is running, you’ll see the usual Triton logs, including the endpoints that are running, like this:

I0215 23:16:18.906473 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0215 23:16:18.906671 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0215 23:16:18.948484 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Leave this terminal running.

5. Send inference requests

Open a second terminal on the same machine and send an inference request with this example client:

python3 max/examples/inference/roberta-python-tensorflow/triton-inference.py \
--input "We think LLMs will unlock creativity and productivity for the whole world."

That's it! You're now running MAX Serving (a service with a MAX Engine backend).

You can also fetch some metadata about the service:

  • Check if the container is started and ready:

    curl -v localhost:8000/v2/health/ready | python3 -m json.tool
  • Get the model metadata (input/output parameters):

    curl localhost:8000/v2/models/roberta | python3 -m json.tool
  • Get the loaded model configuration:

    curl localhost:8000/v2/models/roberta/config | python3 -m json.tool

This has been just a brief walkthrough of MAX Serving in the Developer Edition, which is not commercially-licensed for production deployment. You can freely tinker with it and evaluate it for your AI usecases.

This version is also built only for Triton Inference Server, but our plan is to make MAX Serving available for all leading serving frameworks (NVIDIA Triton, Ray Serve, TF Serving, and others). We’ll also release tools and APIs that allow you to build a custom serving container.

For a more turn-key solution, we'll also make MAX Serving available on AWS Marketplace, so you can quickly deploy a managed container on AWS.

All of this and more is coming soon, when we release the MAX Enterprise Edition. Sign up for updates.