Model support
MAX allows you to pick the perfect GenAI for your project from Hugging Face. You just provide the name of the model you want, and MAX takes care of the rest. It builds the model as a high-performance graph and starts a serving endpoint that runs the model on either a CPU and GPU.
This page explains how this works out of the box with models from Hugging Face, and introduces how you can customize an existing model or create your own.
Model configs
To understand how MAX accelerates hundreds of GenAI models from Hugging Face, you should first know a little about Hugging Face model configurations.
Nowadays, the definitive place to find AI models is Hugging Face Model
Hub. Although models on Hugging Face might be
built and trained with different machine learning frameworks, they all include
a config.json
file, which is like a model blueprint. This file contains all
the information you need to understand the model architecture and its
configuration, such as the number of layers used, the embedding size, and other
hyperparameters.
By reading the model configuration, we can reconstruct any model from Hugging Face as a MAX model.
MAX models
A MAX model is a high-performance inferencing model built with our MAX Python API. It's a unique model format that allows the MAX graph compiler to optimize the model for inference on a wide range of hardware and deliver state-of-the-art performance you normally see only from model-specific inference libraries written in C or C++.
You can build these models yourself with our Python API, but you don't have to.
All you have to do is specify the GenAI model you want from Hugging Face (such
as
meta-llama/Llama-3.2-1B-Instruct
),
and MAX will programmatically reconstruct it as a MAX model.
This works because we have already built a library of base model architectures with the MAX Python API. When you ask MAX to start an inference server with a Hugging Face model, MAX pulls the corresponding pre-built architecture from our library and makes the appropriate changes based on the configuration from Hugging Face.
This all happens automatically when you start a serving endpoint with
max-pipelines
or with the MAX
container. For example, here's how to start an endpoint using
Meta's Llama 3.2 Instruct model as a MAX model:
max-pipelines serve --model-path=meta-llama/Llama-3.2-1B-Instruct
max-pipelines serve --model-path=meta-llama/Llama-3.2-1B-Instruct
When you run the max-pipelines serve
command, MAX pulls the model
configuration and weights from Hugging Face and builds it as a MAX model. Then
it starts up an endpoint with MAX Serve to handle inference
requests.
Customize a model
If you want to load a different set of weights for a given model, you can pass
them in GGUF or Safetensors format using the --weight-path
argument. This
accepts either a local path or a Hugging Face repo with the weights.
For example, here's how to run Llama-3.2-1B-Instruct
on a CPU with quantized
weights (from
bartowski):
max-pipelines serve --model-path=meta-llama/Llama-3.2-1B-Instruct \
--weight-path=bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q6_K.gguf
max-pipelines serve --model-path=meta-llama/Llama-3.2-1B-Instruct \
--weight-path=bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q6_K.gguf
For help creating your own weights in GGUF format, see the tutorial to Bring your own fine-tuned model.
Build your own model
Although our model-building APIs are still under heavy development while we implement the most popular architectures for MAX Serve, you can try building your own models with the APIs today.
To build your own inferencing model with the MAX, the process generally looks like this:
-
Instantiate a
Graph
by specifying the input shape as aTensorType
. -
Build the graph by chaining
ops
functions. Each function takes and returns aValue
object. -
Add the final
Value
to the graph using theoutput()
method.
For more information, see our tutorial to get started with MAX Graph in Python.
PyTorch eager mode
As you might suspect, MAX doesn't have a pre-built architecture to match every model on Hugging Face. But that's fine, because MAX Serve also supports eager-mode execution for all other PyTorch LLMs (using the Hugging Face Transformers API).
If MAX doesn't have a pre-built model architecture for the Hugging Face model you pass in, it falls back to running the model with Hugging Face Transformers. That means the model won't be compiled and accelerated with MAX, but you'll still get a MAX Serve endpoint with an OpenAI-compatible API.
However, this is an increasingly unlikely situation for popular GenAI models,
because most of the popular models are based on a handful of architectures that
we've implemented as MAX models. For example, there are thousands of models
based on the LlamaForCausalLM
architecture.
You can see the most popular models that work with MAX today (either as MAX models or with eager mode) in the MAX model repository.
Get started
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!