Skip to main content
Log in

Model support

MAX allows you to pick the perfect GenAI for your project from Hugging Face. You just provide the name of the model you want, and MAX takes care of the rest. It builds the model as a high-performance graph and starts a serving endpoint that runs the model on either a CPU and GPU.

This page explains how this works out of the box with models from Hugging Face, and introduces how you can customize an existing model or create your own.

Model configs

To understand how MAX accelerates hundreds of GenAI models from Hugging Face, you should first know a little about Hugging Face model configurations.

Nowadays, the definitive place to find AI models is Hugging Face Model Hub. Although models on Hugging Face might be built and trained with different machine learning frameworks, they all include a config.json file, which is like a model blueprint. This file contains all the information you need to understand the model architecture and its configuration, such as the number of layers used, the embedding size, and other hyperparameters.

By reading the model configuration, we can reconstruct any model from Hugging Face as a MAX model.

MAX models

A MAX model is a high-performance inferencing model built with our MAX Python API. It's a unique model format that allows the MAX graph compiler to optimize the model for inference on a wide range of hardware and deliver state-of-the-art performance you normally see only from model-specific inference libraries written in C or C++.

You can build these models yourself with our Python API, but you don't have to. All you have to do is specify the GenAI model you want from Hugging Face (such as meta-llama/Llama-3.2-1B-Instruct), and MAX will programmatically reconstruct it as a MAX model.

This works because we have already built a library of base model architectures with the MAX Python API. When you ask MAX to start an inference server with a Hugging Face model, MAX pulls the corresponding pre-built architecture from our library and makes the appropriate changes based on the configuration from Hugging Face.

This all happens automatically when you start a serving endpoint with max-pipelines or with the MAX container. For example, here's how to start an endpoint using Meta's Llama 3.2 Instruct model as a MAX model:

max-pipelines serve --model-path=meta-llama/Llama-3.2-1B-Instruct
max-pipelines serve --model-path=meta-llama/Llama-3.2-1B-Instruct

When you run the max-pipelines serve command, MAX pulls the model configuration and weights from Hugging Face and builds it as a MAX model. Then it starts up an endpoint with MAX Serve to handle inference requests.

Customize a model

If you want to load a different set of weights for a given model, you can pass them in GGUF or Safetensors format using the --weight-path argument. This accepts either a local path or a Hugging Face repo with the weights.

For example, here's how to run Llama-3.2-1B-Instruct on a CPU with quantized weights (from bartowski):

max-pipelines serve --model-path=meta-llama/Llama-3.2-1B-Instruct \
--weight-path=bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q6_K.gguf
max-pipelines serve --model-path=meta-llama/Llama-3.2-1B-Instruct \
--weight-path=bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q6_K.gguf

For help creating your own weights in GGUF format, see the tutorial to Bring your own fine-tuned model.

Build your own model

Although our model-building APIs are still under heavy development while we implement the most popular architectures for MAX Serve, you can try building your own models with the APIs today.

To build your own inferencing model with the MAX, the process generally looks like this:

  1. Instantiate a Graph by specifying the input shape as a TensorType.

  2. Build the graph by chaining ops functions. Each function takes and returns a Value object.

  3. Add the final Value to the graph using the output() method.

For more information, see our tutorial to get started with MAX Graph in Python.

PyTorch eager mode

As you might suspect, MAX doesn't have a pre-built architecture to match every model on Hugging Face. But that's fine, because MAX Serve also supports eager-mode execution for all other PyTorch LLMs (using the Hugging Face Transformers API).

If MAX doesn't have a pre-built model architecture for the Hugging Face model you pass in, it falls back to running the model with Hugging Face Transformers. That means the model won't be compiled and accelerated with MAX, but you'll still get a MAX Serve endpoint with an OpenAI-compatible API.

However, this is an increasingly unlikely situation for popular GenAI models, because most of the popular models are based on a handful of architectures that we've implemented as MAX models. For example, there are thousands of models based on the LlamaForCausalLM architecture.

You can see the most popular models that work with MAX today (either as MAX models or with eager mode) in the MAX model repository.

Get started