MAX Serve
MAX simplifies the process to own your AI endpoint with a ready-to-deploy inference server called MAX Serve. It's a Python-based serving layer that executes large language models (LLMs) and provides an OpenAI REST endpoint, both locally and in the cloud.
We designed MAX Serve to deliver consistent and reliable performance at scale for LLMs using complex batching and scheduling techniques. It supports native MAX models (models built with MAX Graph) when you want a high-performance GenAI deployment, and off-the-shelf PyTorch LLMs from HuggingFace when you want to explore and experiment.
How it works
We built MAX Serve as a Python library that can run a local endpoint with the
max-pipelines
tool, and deploy to the cloud with our
MAX container. In either case, it provides an OpenAI REST
endpoint to handle incoming requests for your LLM, and a Prometheus-formatted
metrics endpoint to track your model's performance.
MAX Serve provides a low-latency service using a combination of performance-focused designs, including a multi-process HTTP/model worker architecture (maximum CPU core utilization), continuous heterogeneous batching (no waiting to fill a batch size), and multi-step scheduling (parallelize more inference steps for better GPU utilization).
Under the hood, MAX Serve wraps MAX Engine, which is our next-generation graph compiler and runtime that accelerates your models on both CPUs and GPUs.


The MAX container illustrated in figure 1 is pre-configured for compatibility with several different NVIDIA GPU architectures (and AMD GPU support is in the works). All you need to do is specify the model you want from Hugging Face (read about our model support).
To get started, check out our quickstart guide, or try one of the following tutorials.
Supported endpoints
Our OpenAI API-compatible endpoint means you don't have to rewrite your client application to use MAX Serve. We support text-generation, multi-model, and embedding models with the following endpoint APIs:
Additionally, MAX Serve provides a /metrics
endpoint for Prometheus-formatted metrics data.
Get started
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!