Skip to main content
Log in

MAX container

The MAX container is our official Docker container for convenient MAX deployment. It includes the latest MAX version with GPU support, several AI libraries, and integrates with orchestration tools like Kubernetes.

The MAX container image is available in the Modular Docker Hub repository.

docker run --runtime nvidia --gpus 1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
docker.modular.com/modular/max-openai-api:24.6.0 \
--huggingface-repo-id modularai/llama-3.1
docker run --runtime nvidia --gpus 1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
docker.modular.com/modular/max-openai-api:24.6.0 \
--huggingface-repo-id modularai/llama-3.1

If you are running the MAX container image in a cloud environment, you can optionally add the following argument for faster model download speeds:

docker run docker.modular.com/modular/max-openai-api:24.6.0 \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
...
docker run docker.modular.com/modular/max-openai-api:24.6.0 \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
...

Container contents

The MAX container is based on the NVIDIA CUDA Deep Learning Container version 12.5.0 base Ubuntu 22.04 and includes the following:

  • Ubuntu 22.04
  • Python 3.12
  • MAX 24.6
  • PyTorch 2.4.1
  • NumPy
  • Hugging Face Transformers

The MAX container is compatible with any cloud instance that meets the MAX system requirements (NVIDIA A100, A10, L4, and L40 GPUs). The following are some cloud-based GPU instances and virtual machines that we recommend.

AWS instances:

  • P4d instance family (A100 GPU)
  • G5 instance family (A10G GPU)
  • G6 instance family (L4 GPU)
  • G6e instance family (L40S GPU)

GCP instances:

  • A2 machine series (A100 GPU)
  • G2 machine series (L4 GPU)

Azure instances:

Logs

The MAX container writes logs to stdout, which you can consume and view via your cloud provider's platform (for example, with AWS CloudWatch).

Console log level is INFO by default. You can modify the log level using the MAX_SERVE_LOGS_CONSOLE_LEVEL environment variable. It accepts the following log levels (in order of increasing verbosity): CRITICAL, ERROR, WARNING, INFO, DEBUG. For example:

docker run docker.modular.com/modular/max-openai-api:24.6.0 \
-e MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG \
...
docker run docker.modular.com/modular/max-openai-api:24.6.0 \
-e MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG \
...

For readability, logs default to unstructured text, but you can emit them with structured JSON by adding the MODULAR_STRUCTURED_LOGGING=1 environment variable.

Metrics

The MAX container exposes a /metrics endpoint that follows the Prometheus text format. You can scrape the metrics listed below using Prometheus or another collection service.

These are raw metrics and it's up to you to compute the desired time series and aggregations. For example, we provide a count for output tokens (maxserve_num_output_tokens_total), which you can use to calculate the output tokens per second (OTP/s).

Here are all the available metrics:

  • maxserve_request_time_milliseconds: Histogram of time spent handling each request (total inference time, or TIT), in milliseconds.
  • maxserve_input_processing_time_milliseconds: Histogram of input processing time (IPT), in milliseconds.
  • maxserve_output_processing_time_milliseconds: Histogram of output generation time (OGT), in milliseconds.
  • maxserve_time_to_first_token_milliseconds: Histogram of time to first token (TTFT), in milliseconds.
  • maxserve_num_input_tokens_total: Total number of input tokens processed so far.
  • maxserve_num_output_tokens_total: Total number of output tokens processed so far.
  • maxserve_request_count_total: Total requests since start.
  • maxserve_num_requests_running: Number of requests currently running.

Telemetry

In addition to sharing these metrics via the /metrics endpoint, the MAX container actively sends the metrics to Modular via push telemetry (using OpenTelemetry).

This telemetry is anonymous and helps us quickly identify problems and build better products for you. Without this telemetry, we would rely solely on user-submitted bug reports, which are limited and would severely limit our performance insights.

However, if you don't want to share this data with Modular, you can disable telemetry in your container. To disable telemetry, enable the MAX_SERVE_DISABLE_TELEMETRY environment variable when you start your MAX container. For example:

docker run docker.modular.com/modular/max-openai-api:24.6.0 \
-e MAX_SERVE_DISABLE_TELEMETRY=1 \
...
docker run docker.modular.com/modular/max-openai-api:24.6.0 \
-e MAX_SERVE_DISABLE_TELEMETRY=1 \
...

Deployment and user ID

Again, the telemetry is completely anonymous by default. But if you'd like to share some information to help our team assist you in understanding your deployment performance, you can add some identity information to the telemetry with these environment variables:

  • MAX_SERVE_DEPLOYMENT_ID : Your application name
  • MODULAR_USER_ID Your company name

For example:

docker run docker.modular.com/modular/max-openai-api:24.6.0 \
-e MAX_SERVE_DEPLOYMENT_ID='Project name' \
-e MODULAR_USER_ID='Example Inc.' \
...
docker run docker.modular.com/modular/max-openai-api:24.6.0 \
-e MAX_SERVE_DEPLOYMENT_ID='Project name' \
-e MODULAR_USER_ID='Example Inc.' \
...

License

The MAX container is released under the NVIDIA Deep Learning Container license.

Get started