MAX container
The MAX container is our official Docker container that simplifies the process to deploy a GenAI model on an endpoint with MAX Serve. The container includes the latest version of MAX and it integrates with orchestration tools like Kubernetes.
Alternatively, you can also experiment with MAX Serve on a local endpoint using
the max-pipelines serve
command. The result is
basically the same because the MAX container is basically just a containerized
environment that runs max-pipelines serve
to create the endpoint.
Get started
First, make sure you have Docker installed.
Then start the container by specifying a model with a Hugging Face repo ID:
docker run --gpus=1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
docker.modular.com/modular/max-openai-api:nightly \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
docker run --gpus=1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
docker.modular.com/modular/max-openai-api:nightly \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
It can take a few minutes to pull the container and then download and compile the model. When the endpoint is ready, you'll see a message like this:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
To try a different GenAI model, check out our list of models on MAX Builds.
For information about the available container versions/tags, see the Modular Docker Hub repository.
Container options
The docker run
command above includes the bare minimum commands and options,
but there are other docker
options you might consider, plus several options
to control features of the MAX Serve endpoint.
Docker options
-
--gpus
: If your system includes a compatible GPU, you must add the--gpus
option in order for the container to access it. It doesn't hurt to include this even if your system doesn't have a GPU compatible with MAX.Currently, MAX supports just one GPU at a time.
-
-v
: We use the-v
option to save a cache of Hugging Face models to your local disk that we can reuse across containers. -
-p
: We use the-p
option to specify the exposed port for the endpoint.
You also might need some environment variables (set with --env
):
-
HF_TOKEN
: This is required to access gated models on Hugging Face (after your account is granted access). For example:docker run \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<YOUR_HF_TOKEN>" \
-p 8000:8000 \
docker.modular.com/modular/max-openai-api:nightly \
--model-path mistralai/Mistral-7B-Instruct-v0.2docker run \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<YOUR_HF_TOKEN>" \
-p 8000:8000 \
docker.modular.com/modular/max-openai-api:nightly \
--model-path mistralai/Mistral-7B-Instruct-v0.2Learn more about
HF_TOKEN
and how to create Hugging Face access tokens. -
HF_HUB_ENABLE_HF_TRANSFER
: Set this to1
to enable faster model downloads from Hugging Face. For example:docker run \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
docker.modular.com/modular/max-openai-api:nightly \
--model-path modularai/Llama-3.1-8B-Instruct-GGUFdocker run \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
docker.modular.com/modular/max-openai-api:nightly \
--model-path modularai/Llama-3.1-8B-Instruct-GGUFLearn more about
HF_HUB_ENABLE_HF_TRANSFER
.
MAX Serve options
Following the container name in the docker run
command, you must specify a
model with --model-path
, but there are other options you might need
to configure the MAX Serve behavior. But here are a couple important options:
To see all available options, see the max-pipelines
page, because the MAX container is basically a wrapper
around that tool.
-
--model-path
: This is required to specify the model you want to deploy. To find other GenAI models that are compatible with MAX, check out our list of models on MAX Builds. -
--max-length
: Specifies the maximum length of the text sequence (includes the input tokens). We mention this one here because it's often necessary to adjust the max length when you have trouble running a large model on a machine with limited memory.
For the rest of the MAX Serve options, see the max-pipelines
page.
Container contents
The MAX container is based on the NVIDIA CUDA Deep Learning Container version 12.5.0 base Ubuntu 22.04 and includes the following:
- Ubuntu 22.04
- Python 3.12
- MAX 24.6
- PyTorch 2.4.1
- NumPy
- Hugging Face Transformers
Recommended cloud instances
For best performance and compatibliity with the available models on MAX Builds, we recommend that you deploy the MAX container on a cloud instance with a GPU that meets the MAX system requirements.
The following are some cloud-based GPU instances and virtual machines that we recommend.
AWS instances:
- P4d instance family (A100 GPU)
- G5 instance family (A10G GPU)
- G6 instance family (L4 GPU)
- G6e instance family (L40S GPU)
GCP instances:
Azure instances:
- NC_A100_v4-series virtual machine
- NDm_A100_v4-series virtual machine
- ND_A100_v4-series virtual machine
- NVads-A10 v5-series virtual machine
Logs
The MAX container writes logs to stdout, which you can consume and view via your cloud provider's platform (for example, with AWS CloudWatch).
Console log level is INFO
by default. You can modify the log level using the
MAX_SERVE_LOGS_CONSOLE_LEVEL
environment variable. It accepts the following
log levels (in order of increasing verbosity): CRITICAL
, ERROR
, WARNING
,
INFO
, DEBUG
. For example:
docker run docker.modular.com/modular/max-openai-api:nightly \
-e MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG \
...
docker run docker.modular.com/modular/max-openai-api:nightly \
-e MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG \
...
For readability, logs default to unstructured text, but you can emit them with
structured JSON by adding the MODULAR_STRUCTURED_LOGGING=1
environment
variable.
Metrics
The MAX container exposes a /metrics
endpoint that follows the
Prometheus text format.
You can scrape the metrics listed below using Prometheus or another collection
service.
These are raw metrics and it's up to you to compute the desired time series and
aggregations. For example, we provide a count for output tokens
(maxserve_num_output_tokens_total
), which you can use to calculate the output
tokens per second (OTP/s).
Here are all the available metrics:
maxserve_request_time_milliseconds
: Histogram of time spent handling each request (total inference time, or TIT), in milliseconds.maxserve_input_processing_time_milliseconds
: Histogram of input processing time (IPT), in milliseconds.maxserve_output_processing_time_milliseconds
: Histogram of output generation time (OGT), in milliseconds.maxserve_time_to_first_token_milliseconds
: Histogram of time to first token (TTFT), in milliseconds.maxserve_num_input_tokens_total
: Total number of input tokens processed so far.maxserve_num_output_tokens_total
: Total number of output tokens processed so far.maxserve_request_count_total
: Total requests since start.maxserve_num_requests_running
: Number of requests currently running.
Telemetry
In addition to sharing these metrics via the /metrics
endpoint, the MAX
container actively sends the metrics to Modular via push telemetry (using
OpenTelemetry).
This telemetry is anonymous and helps us quickly identify problems and build better products for you. Without this telemetry, we would rely solely on user-submitted bug reports, which are limited and would severely limit our performance insights.
However, if you don't want to share this data with Modular, you can disable
telemetry in your container. To disable telemetry, enable the
MAX_SERVE_DISABLE_TELEMETRY
environment variable when you start your MAX
container. For example:
docker run docker.modular.com/modular/max-openai-api:nightly \
-e MAX_SERVE_DISABLE_TELEMETRY=1 \
...
docker run docker.modular.com/modular/max-openai-api:nightly \
-e MAX_SERVE_DISABLE_TELEMETRY=1 \
...
Deployment and user ID
Again, the telemetry is completely anonymous by default. But if you'd like to share some information to help our team assist you in understanding your deployment performance, you can add some identity information to the telemetry with these environment variables:
MAX_SERVE_DEPLOYMENT_ID
: Your application nameMODULAR_USER_ID
Your company name
For example:
docker run docker.modular.com/modular/max-openai-api:nightly \
-e MAX_SERVE_DEPLOYMENT_ID='Project name' \
-e MODULAR_USER_ID='Example Inc.' \
...
docker run docker.modular.com/modular/max-openai-api:nightly \
-e MAX_SERVE_DEPLOYMENT_ID='Project name' \
-e MODULAR_USER_ID='Example Inc.' \
...
License
The MAX container is released under the NVIDIA Deep Learning Container license.
Next steps
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!