Skip to main content

Intro to MAX Serving

MAX Serving is a deployment solution for MAX Engine that works as a drop-in replacement for your existing server-side inferencing runtime. Max Serving implements KServe APIs, allowing you to send inference requests to the cloud with the same APIs you already use and without requiring any changes to your client programs.

Think of MAX Serving like a wrapper around MAX Engine, making it easier for you to deploy MAX Engine into production without writing all the code to handle incoming requests, load models, balance traffic, and so on.

Try it now

MAX Serving is available today for individual developers. Contact our sales team for information on commercial licenses for production use.

Why we created MAX Serving

Deploying AI at scale is complicated because the tools are fragmented, each with their own trade-offs and limitations. It sometimes seems like an impossible task to figure out which combination of tools and hardware provides the best performance-to-cost tradeoff for any given model. MAX Engine solves these problems by providing a state-of-the-art compiler and inference runtime that works with a variety of models and hardware types.

With MAX Engine as part of your AI development workflow, the next step is to deploy your models into production, using trustworthy tools that include robust scaling, monitoring, and deployment templates. That's where MAX Serving comes in.

MAX Serving meets you where you are and integrates with your existing serving infrastructure, so you can choose a hosting solution that fits your needs. You can still use the client APIs you know, use the serving/cloud infrastructure you trust, and get all the performance and framework optionality that MAX Engine provides.

How MAX Serving works

In addition to all the features provided by MAX Engine, MAX Serving implements KServe APIs to handle inference requests, query model metadata, check service health, and so on. MAX Serving works as drop-in replacement for Triton, TorchServe, Intel Model Server, and other services based on KServe.

Your client application can continue to use the same APIs to send requests and receive responses from your cloud-hosted AI inference service. The only difference is that your inferencing backend is now powered by MAX Engine.

What to expect from MAX Serving

To begin, we're shipping MAX Serving in a Docker image that uses NVIDIA Triton Inference Server, but our plan is to make MAX Serving available for all leading serving frameworks.

We’ll also release tools and APIs that allow you to build a custom serving container.

For a more turn-key solution, we'll also make MAX Serving available on AWS Marketplace, so you can quickly deploy a managed container on AWS.

To stay up to date on our progress, sign up for our newsletter.

Try MAX Serving now and let us know what you think!