Skip to main content

MAX Serving intro

MAX Serving is a deployment solution for MAX Engine that works as a drop-in replacement for your existing server-side inferencing runtime. It doesn't require any changes to your client programs. It implements KServe APIs so you can continue sending inference requests to the cloud with the same APIs you already use.

You can think of MAX Serving like a wrapper around MAX Engine, making it easier for you to deploy MAX Engine into production without writing all the code to handle incoming requests, load models, balance traffic, and so on.

Try it now

You can try MAX Serving today as part of the MAX Developer Edition. The commercially-licensed version for production use is coming soon in the MAX Enterprise Edition.

Why we created MAX Serving

Deploying AI at scale is complicated because the tools are fragmented, each with their own trade-offs and limitations. It sometimes seems like an impossible task to figure out which combination of tools and hardware provides the best performance-to-cost tradeoff for any given model. MAX Engine solves these problems by providing a state-of-the-art compiler and inference runtime that works with TensorFlow, PyTorch, and ONNX models, and on a wide range of hardware.

From there, it's just a matter of deploying your models with MAX Engine into production, using trustworthy tools that include robust scaling, monitoring, and deployment templates. That's where MAX Serving comes in.

MAX Serving will meet you where you are and integrate with your existing serving infrastructure, so you can choose a hosting solution that fits your needs. You can still use the client APIs you know, use the serving/cloud infrastructure you trust, and get all the performance and framework optionality provided by MAX Engine.

How MAX Serving works

In addition to all the features provided by MAX Engine, MAX Serving implements KServe APIs to handle inference requests, query model metadata, check service health, and so on. It works as drop-in replacement for Triton, TF Serving, TorchServe, Intel Model Server, and other services based on KServe.

Your client application can continue to use the same APIs to send requests and receive responses from your cloud-hosted AI inference service. The only difference is that your inferencing backend is now powered by MAX Engine.

To begin, we're shipping MAX Serving in a Docker image that uses NVIDIA Triton Inference Server, but our plan is to make MAX Serving available for all leading serving frameworks (NVIDIA Triton, Ray Serve, TF Serving, and others). We’ll also release tools and APIs that allow you to build a custom serving container.

For a more turn-key solution, we'll also make MAX Serving available on AWS Marketplace, so you can quickly deploy a managed container on AWS.

All of this and more is coming soon, when we release the MAX Enterprise Edition. Sign up for updates.

Until then, you can try MAX Serving in the MAX Developer Edition (our non-commercially licensed developer kit). We've built a Docker image for you that includes Triton and MAX Serving, and an example Python client.

Try MAX Serving now