What is MAX
The Modular Accelerated Xecution (MAX) platform is a unified set of APIs and tools that simplify the process of building and deploying your own high-performance AI endpoint. MAX provides complete flexibility, so you can use your own data and your own model on the hardware of your choice, with the best performance-to-cost tradeoff.
The foundation of MAX is our next-generation graph compiler and runtime—MAX Engine—that delivers unparalleled speed for your PyTorch and GenAI models. Importantly, MAX Engine is built without vendor-specific hardware libraries, enabling it to scale effortlessly across a wide range of CPUs and GPUs. But MAX is much more than a fast and portable runtime. It also includes a quick-to-deploy serving layer—MAX Serve—that orchestrates inference inputs and outputs between your model and client application.
And in the near future, MAX will also include a fully programmable interface for model development and GPU programming, so you can build fully customized and highly-tuned AI models.
We built MAX because there wasn't a single solution that could support all the needs of AI developers today while also helping them scale into the future. Developers need a solution that supports their full inference workflow, from exploration of new use cases to deployment of high-performance cloud services. This requires a tool that provides world-class out-of-the-box performance and portability, that's tightly integrated with AI ecosystem tools such as Python, PyTorch, and Hugging Face, and that's fully extensible for new ideas.
What MAX offers
-
Unparalleled GenAI performance: When you need to scale your workloads and reduce your costs, MAX provides unparalleled out-of-the-box speed-ups for PyTorch and GenAI models on CPUs and GPUs.
-
Hardware portability: We built MAX from the ground up to be independent of vendor-specific hardware libraries, enabling it to scale effortlessly across a wide range of CPUs and GPUs so you can select the best hardware for your use case.
-
Model extensibility: When the off-the-shelf models don't provide the performance you need, MAX includes a Python API to build high-performance GenAI models such as large language models (LLMs) and to extend other models with custom operations (coming soon).
-
Seamless deployment: MAX integrates with existing tools and cloud infrastructure to minimize migration effort. Our optional serving library called MAX Serve is available in a ready-to-deploy container, provides an OpenAI API endpoint, and works with Kubernetes.
MAX enables all of this with a rich set of Python APIs, backed by vendor-agnostic GPU kernels written in Mojo. Mojo is a new GPU programming language that looks and feels like Python and integrates with Python code, but it provides the performance, control, and safety of languages like C++, Rust, and Swift.
How to use MAX
If you want to run inference on MAX Engine using our Python APIs on your local
system, you can install MAX as a conda package. We made it especially simple to
get started with the magic
command-line tool, which creates a unique virtual
environment for your project, in which you can install MAX and any other
conda/PyPI packages.
To try MAX Serve and create an OpenAI-compatible API endpoint for an LLM, you
have a couple options. You can use magic
to immediately start a local service
for your LLM, or use our MAX container to deploy MAX Serve to
the cloud provider of your choice. Either way, you can select an LLM from a
variety of PyTorch models on Hugging Face or use one of our high-performance AI
models built with MAX Graph for the best performance on GPUs.
To optimize your model's performance, you can write custom ops that MAX can analyze, optimize, and fuse into the graph (coming soon). Or, you can build your model with the MAX Graph Python API and load pre-trained weights from Hugging Face to unlock even more performance for GenAI models.
Get started
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!