Model support

MAX allows you to pick the perfect GenAI for your project from Hugging Face. You just provide the name of the model you want, and MAX takes care of the rest. It builds the model as a high-performance graph and starts a serving endpoint that runs the model on either a CPU and GPU.

This page explains how this works out of the box with models from Hugging Face, and introduces how you can customize an existing model or create your own.

MAX model repo

If you just want to browse some models, check out the MAX model repository.

Model configs

To understand how MAX accelerates hundreds of GenAI models from Hugging Face, you should first know a little about Hugging Face model configurations.

Nowadays, the definitive place to find AI models is Hugging Face Model Hub. Although models on Hugging Face might be built and trained with different machine learning frameworks, they all include a config.json file, which is like a model blueprint. This file contains all the information you need to understand the model architecture and its configuration, such as the number of layers used, the embedding size, and other hyperparameters.

By reading the model configuration, we can reconstruct any model from Hugging Face as a MAX model.

MAX models

A MAX model is a high-performance inferencing model built with our MAX Python API. It's a unique model format that allows the MAX graph compiler to optimize the model for inference on a wide range of hardware and deliver state-of-the-art performance you normally see only from model-specific inference libraries written in C or C++.

You can build these models yourself with our Python API, but you don't have to. All you have to do is specify the GenAI model you want from Hugging Face (such as meta-llama/Llama-3.2-1B-Instruct), and MAX will programmatically reconstruct it as a MAX model.

This works because we have already built a library of base model architectures with the MAX Python API. When you ask MAX to start an inference server with a Hugging Face model, MAX pulls the corresponding pre-built architecture from our library and makes the appropriate changes based on the configuration from Hugging Face.

This all happens automatically when you start a serving endpoint with the max CLI or with the MAX container. For example, here's how to start an endpoint using Meta's Llama 3.2 Instruct model as a MAX model:

max serve --model-path=meta-llama/Llama-3.2-1B-Instruct
max serve --model-path=meta-llama/Llama-3.2-1B-Instruct

This model requires a GPU

The command above will fail if your system doesn't have a compatible GPU. However, you can make it work if you instead load quantized weights as shown below.

When you run the max serve command, MAX pulls the model configuration and weights from Hugging Face and builds it as a MAX model. Then it starts up an endpoint to handle inference requests that you send using our REST API.

Customize a model

If you want to load a different set of weights for a given model, you can pass them in GGUF or Safetensors format using the --weight-path argument. This accepts either a local path or a Hugging Face repo with the weights.

For example, here's how to run Llama-3.2-1B-Instruct on a CPU with quantized weights (from bartowski):

max serve --model-path=meta-llama/Llama-3.2-1B-Instruct \
  --weight-path=bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q6_K.gguf
max serve --model-path=meta-llama/Llama-3.2-1B-Instruct \
  --weight-path=bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q6_K.gguf

When using GGUF models, quantization encoding formats are automatically detected. When using the max command with a model from a Hugging Face repository, explicitly providing a quantization encoding is optional.

max serve --model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \
  --quantization-encoding=q4_k
max serve --model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \
  --quantization-encoding=q4_k

If no quantization encoding is specified, MAX Serve automatically detects and uses the first encoding option from the repository. If a quantization encoding is provided, it must align with the available encoding options in the repository. If the repository contains multiple quantization formats, be sure to specify which encoding type you want to use.

For help creating your own weights in GGUF format, see the tutorial to Bring your own fine-tuned model.

Build your own model

Although our model-building APIs are still under heavy development while we implement the most popular architectures, you can also build your own models with the MAX APIs today.

To build your own inferencing model with the MAX, the process generally looks like this:

Instantiate a Graph by specifying the input shape as a TensorType.
Build the graph by chaining ops functions. Each function takes and returns a Value object.
Add the final Value to the graph using the output() method.

For more information, see our tutorial to get started with MAX Graph in Python.

PyTorch eager mode

As you might suspect, MAX doesn't have a pre-built architecture to match every model on Hugging Face. But that's fine, because MAX also supports eager-mode execution for all other PyTorch LLMs (using the Hugging Face Transformers API).

If MAX doesn't have a pre-built model architecture for the Hugging Face model you pass in, it falls back to running the model with Hugging Face Transformers. That means the model won't be compiled and accelerated with MAX, but you'll still get an endpoint with our serving API that's OpenAI-compatible.

However, this is an increasingly unlikely situation for popular GenAI models, because most of the popular models are based on a handful of architectures that we've implemented as MAX models. For example, there are thousands of models based on the LlamaForCausalLM architecture.

You can see the most popular models that work with MAX today (either as MAX models or with eager mode) in the MAX model repository.

Model configs​

MAX models​

Customize a model​

Build your own model​

PyTorch eager mode​

Get started​

Model configs

MAX models

Customize a model

Build your own model

PyTorch eager mode

Get started