Skip to main content
Log in

MAX Serve

MAX simplifies the process to own your AI endpoint with a ready-to-deploy inference server called MAX Serve. It's a Python-based serving layer that executes large language models (LLMs) and provides an OpenAI REST endpoint, both locally and in the cloud.

We designed MAX Serve to deliver consistent and reliable performance at scale for LLMs using complex batching and scheduling techniques. It supports native MAX models (models built with MAX Graph) when you want a high-performance GenAI deployment, and off-the-shelf PyTorch LLMs from HuggingFace when you want to explore and experiment.

How it works

We built MAX Serve as a Python library that can run a local endpoint with the max-pipelines tool, and deploy to the cloud with our MAX container. In either case, it provides an OpenAI REST endpoint to handle incoming requests for your LLM, and a Prometheus-formatted metrics endpoint to track your model's performance.

MAX Serve provides a low-latency service using a combination of performance-focused designs, including a multi-process HTTP/model worker architecture (maximum CPU core utilization), continuous heterogeneous batching (no waiting to fill a batch size), and multi-step scheduling (parallelize more inference steps for better GPU utilization).

Under the hood, MAX Serve wraps MAX Engine, which is our next-generation graph compiler and runtime that accelerates your models on both CPUs and GPUs.

Figure 1. A simplified diagram of how MAX Serve handle inference requests from your client app.

The MAX container illustrated in figure 1 is pre-configured for compatibility with several different NVIDIA GPU architectures (and AMD GPU support is in the works). All you need to do is specify the model you want from Hugging Face (read about our model support).

To get started, check out our quickstart guide, or try one of the following tutorials.

Supported endpoints

Our OpenAI API-compatible endpoint means you don't have to rewrite your client application to use MAX Serve. We support text-generation, multi-model, and embedding models with the following endpoint APIs:

Additionally, MAX Serve provides a /metrics endpoint for Prometheus-formatted metrics data.

Get started