MAX container

The MAX container is our official Docker container that simplifies the process to deploy a GenAI model with an OpenAI-compatible endpoint. The container includes the latest version of MAX and it integrates with orchestration tools like Kubernetes.

Alternatively, you can also experiment with MAX on a local endpoint using the max serve command. The result is basically the same because the MAX container is a containerized environment that runs max serve to create the endpoint you can interact with using our OpenAI-compatible REST API.

Linux only

The MAX container is currently not compatible with macOS.

Get started

First, make sure you're on a system with the following requirements:

Linux

WSL

GPU

Then start an endpoint with the MAX container:

Make sure you have Docker installed.

Start the container and an endpoint for Llama 3:

NVIDIA
AMD

docker run --gpus=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  modular/max-nvidia-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run --gpus=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  modular/max-nvidia-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --group-add keep-groups \
  --device /dev/kfd \
  --device /dev/dri \
  modular/max-amd:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --group-add keep-groups \
  --device /dev/kfd \
  --device /dev/dri \
  modular/max-amd:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

It can take a few minutes to pull the container and then download and compile the model.

When the endpoint is ready, you'll see a message that says this:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Open a new terminal and send a request using the openai Python API or curl:

Python
cURL

Create a new virtual environment:

mkdir quickstart && cd quickstart
mkdir quickstart && cd quickstart

python3 -m venv .venv/quickstart \
  && source .venv/quickstart/bin/activate
python3 -m venv .venv/quickstart \
  && source .venv/quickstart/bin/activate

Install the OpenAI Python API:
```
pip install openai
```
```
pip install openai
```

Create the following file to send an inference request:

generate-text.py
from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:8000/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="modularai/Llama-3.1-8B-Instruct-GGUF",
    messages=[
        {
          "role": "user",
          "content": "Who won the world series in 2020?"
        },
    ],
)

print(completion.choices[0].message.content)
from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:8000/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="modularai/Llama-3.1-8B-Instruct-GGUF",
    messages=[
        {
          "role": "user",
          "content": "Who won the world series in 2020?"
        },
    ],
)

print(completion.choices[0].message.content)

Run it and you should see results like this:

python generate-text.py

python generate-text.py

The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.

The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.

Run this command:

curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
    "stream": true,
    "messages": [
        {"role": "user", "content": "Who won the World Series in 2020?"}
    ]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
    "stream": true,
    "messages": [
        {"role": "user", "content": "Who won the World Series in 2020?"}
    ]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

You should see results like this:

The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.

The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.

For details about the OpenAI-compatible endpoint, see our Serve API docs.

To run a different model, change the --model-path to something else from our model repository.

For information about the available containers, see the Modular Docker Hub repositories.

Container options

The docker run command above includes the bare minimum commands and options, but there are other docker options you might consider, plus several options to control features of the endpoint.

Docker options

--gpus: If your system includes a compatible NVIDIA GPU, you must add the --gpus option in order for the container to access it. It doesn't hurt to include this even if your system doesn't have a GPU compatible with MAX.
--devices: When deploying MAX on multiple GPUs, you must specify the ID of the GPUs to use. For example, to use four available GPUs, you should include the following: --devices gpu:0,1,2,3. When you don't specify a --devices option, MAX defaults to using the first available GPU it discovers (equivalent to --devices gpu:0). You can also optionally specify --devices cpu.
-v: We use the -v option to save a cache of Hugging Face models to your local disk that we can reuse across containers.
-p: We use the -p option to specify the exposed port for the endpoint.

You also might need some environment variables (set with --env):

HF_TOKEN: This is required to access gated models on Hugging Face (after your account is granted access). For example:

docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  -p 8000:8000 \
  modular/max-nvidia-full:latest \
  --model-path mistralai/Mistral-7B-Instruct-v0.2
docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  -p 8000:8000 \
  modular/max-nvidia-full:latest \
  --model-path mistralai/Mistral-7B-Instruct-v0.2

Learn more about HF_TOKEN and how to create Hugging Face access tokens.

HF_HUB_ENABLE_HF_TRANSFER: Set this to 1 to enable faster model downloads from Hugging Face. For example:

docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  modular/max-nvidia-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  modular/max-nvidia-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

Learn more about HF_HUB_ENABLE_HF_TRANSFER.

MAX options

Following the container name in the docker run command, you must specify a model with --model-path, but there are other options you might need to configure the max serve behavior.

To see all available options, see the max CLI page, because the MAX container is basically a wrapper around that tool.

--model-path: This is required to specify the model you want to deploy. To find other GenAI models that are compatible with MAX, check out our list of models on MAX Builds.
--max-length: Specifies the maximum length of the text sequence (including the input tokens). We mention this one here because it's often necessary to adjust the max length when you have trouble running a large model on a machine with limited memory.

For the rest of the max serve options, see the max CLI page.

Container contents

There are multiple MAX container options, including:

Full container

The full MAX container (max-full) is a hardware-agnostic container that's built to deploy the latest version of MAX on both AMD and NVIDIA GPUs.

You can run the container on either NVIDIA or AMD as follows:

NVIDIA
AMD

docker run --gpus=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  -p 8000:8000 \
  modular/max-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run --gpus=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  -p 8000:8000 \
  modular/max-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  --group-add keep-groups \
  --device /dev/kfd \
  --device /dev/dri \
  -p 8000:8000 \
  modular/max-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  --group-add keep-groups \
  --device /dev/kfd \
  --device /dev/dri \
  -p 8000:8000 \
  modular/max-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

The max-full container includes the following:

Ubuntu 22.04
Python 3.12
MAX 25.4
PyTorch (GPU) 2.6.0
ROCm
cuDNN
CUDA 12.8
NumPy
Hugging Face Transformers

For more information, see the full MAX container on Docker Hub.

AMD container

The AMD MAX container (max-amd) is great if you want an AMD-specific deployment without NVIDIA or CUDA dependencies.

You can run the AMD container as follows:

docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  --group-add keep-groups \
  --device /dev/kfd \
  --device /dev/dri \
  -p 8000:8000 \
  modular/max-amd:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  --group-add keep-groups \
  --device /dev/kfd \
  --device /dev/dri \
  -p 8000:8000 \
  modular/max-amd:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

For more information, see the AMD MAX container on Docker Hub.

NVIDIA container

The NVIDIA MAX container is available in two flavors:

max-nvidia-full includes all CUDA and PyTorch GPU dependencies
max-nvidia-base includes minimal dependencies, PyTorch CPU, and the NVIDIA Driver (excludes CUDA)

You can run the NVIDIA container as follows:

docker run --gpus=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  -p 8000:8000 \
  modular/max-nvidia-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run --gpus=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  --env "HF_TOKEN=<YOUR_HF_TOKEN>" \
  -p 8000:8000 \
  modular/max-nvidia-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

Or, to use the base container, replace max-nvidia-full with max-nvidia-base.

For more information, see the full NVIDIA container or base NVIDIA container on Docker Hub.

Recommended cloud instances

For best performance and compatibility with the available models on MAX Builds, we recommend that you deploy the MAX container on a cloud instance with a GPU that meets the MAX system requirements.

The following are some cloud-based GPU instances and virtual machines that we recommend.

AWS instances:

P5 instance family (H100 GPU)
P4d instance family (A100 GPU)
G5 instance family (A10G GPU)
G6 instance family (L4 GPU)
G6e instance family (L40S GPU)

GCP instances:

A3 machine series (H100 GPU)
A2 machine series (A100 GPU)
G2 machine series (L4 GPU)

Azure instances:

NCads_H100_v5-series virtual machine
NCCads_H100_v5-series virtual machine
ND_H100_v5-series virtual machine
NC_A100_v4-series virtual machine
NDm_A100_v4-series virtual machine
ND_A100_v4-series virtual machine
NVads-A10 v5-series virtual machine
ND_MI300X_v5-series virtual machine (AMD GPU)

Logs

The MAX container writes logs to stdout in JSON format, which you can consume and view via your cloud provider's platform (for example, with AWS CloudWatch).

Console log level is INFO by default. You can modify the log level using the MAX_SERVE_LOGS_CONSOLE_LEVEL environment variable. It accepts the following log levels (in order of increasing verbosity): CRITICAL, ERROR, WARNING, INFO, DEBUG. For example:

docker run modular/max-nvidia-full:latest \
    -env MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG \
    ...
docker run modular/max-nvidia-full:latest \
    -env MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG \
    ...

Logs default to structured JSON, but if you'd like a more readable format in your console, you can disable structured logs by adding the MODULAR_STRUCTURED_LOGGING=0 environment variable. For example:

docker run --gpus=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env "MODULAR_STRUCTURED_LOGGING=0" \
  modular/max-nvidia-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run --gpus=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env "MODULAR_STRUCTURED_LOGGING=0" \
  modular/max-nvidia-full:latest \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF

Metrics

The MAX container exposes a /metrics endpoint that follows the Prometheus text format. You can scrape the metrics listed below using Prometheus or another collection service.

These are raw metrics and it's up to you to compute the desired time series and aggregations. For example, we provide a count for output tokens (maxserve_num_output_tokens_total), which you can use to calculate the output tokens per second (OTP/s).

Here are all the available metrics:

maxserve_request_time_milliseconds: Histogram of time spent handling each request (total inference time, or TIT), in milliseconds.
maxserve_input_processing_time_milliseconds: Histogram of input processing time (IPT), in milliseconds.
maxserve_output_processing_time_milliseconds: Histogram of output generation time (OGT), in milliseconds.
maxserve_time_to_first_token_milliseconds: Histogram of time to first token (TTFT), in milliseconds.
maxserve_num_input_tokens_total: Total number of input tokens processed so far.
maxserve_num_output_tokens_total: Total number of output tokens processed so far.
maxserve_request_count_total: Total requests since start.
maxserve_num_requests_running: Number of requests currently running.

Telemetry

In addition to sharing these metrics via the /metrics endpoint, the MAX container actively sends the metrics to Modular via push telemetry (using OpenTelemetry).

None of the telemetry includes personally identifiable information (PII).

This telemetry is anonymous and helps us quickly identify problems and build better products for you. Without this telemetry, we would rely solely on user-submitted bug reports, which are limited and would severely limit our performance insights.

However, if you don't want to share this data with Modular, you can disable telemetry in your container. To disable telemetry, enable the MAX_SERVE_DISABLE_TELEMETRY environment variable when you start your MAX container. For example:

docker run modular/max-nvidia-full:latest \
    -env MAX_SERVE_DISABLE_TELEMETRY=1 \
    ...
docker run modular/max-nvidia-full:latest \
    -env MAX_SERVE_DISABLE_TELEMETRY=1 \
    ...

Deployment and user ID

Again, the telemetry is completely anonymous by default. But if you'd like to share some information to help our team assist you in understanding your deployment performance, you can add some identity information to the telemetry with these environment variables:

MAX_SERVE_DEPLOYMENT_ID: Your application name.
MODULAR_USER_ID: Your company name.

For example:

docker run modular/max-nvidia-full:latest \
    -env MAX_SERVE_DEPLOYMENT_ID='Project name' \
    -env MODULAR_USER_ID='Example Inc.' \
    ...
docker run modular/max-nvidia-full:latest \
    -env MAX_SERVE_DEPLOYMENT_ID='Project name' \
    -env MODULAR_USER_ID='Example Inc.' \
    ...

License

The NVIDIA MAX container is released under the NVIDIA Deep Learning Container license.

Get started​

Container options​

Docker options​

MAX options​

Container contents​

Full container​

AMD container​

NVIDIA container​

Recommended cloud instances​

Logs​

Metrics​

Telemetry​

Deployment and user ID​

License​

Next steps​