Benchmark MAX on an NVIDIA H100 GPU

Ehsan M. Kermani

Developer Relations

Judy Heflin

Technical Writer

benchmark

gpu

h100

MAX supports many GPU types

Although this page is focused on NVIDIA H100, you can try MAX with several other GPU types. For compatibility information, see the GPU requirements.

Performance optimization is a key challenge in deploying AI inference workloads, especially when balancing factors like accuracy, latency, and cost. In this tutorial, we'll show you how to benchmark MAX on an NVIDIA H100 GPU, using a Python script to evaluate key metrics, including the following:

Request throughput
Input and output token throughput
Time-to-first-token (TTFT)
Time per output token (TPOT)

Our script (benchmark_serving.py) is adapted from vLLM with additional features, such as client-side GPU metric collection to ensure consistent and comprehensive performance measurement that's tailored to MAX.

Before we start the benchmark script, we'll start an endpoint running Llama 3 with MAX. Then we'll use the benchmark_serving.py script to send a bunch of inference requests and measure the performance.

Requirements

To get started with this tutorial, you need the following:

Hardware: Local access to NVIDIA H100 GPUs
Python: Version 3.9 - 3.13

Pixi: You can install with this command:

curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh

Docker and Docker Compose: Installed with NVIDIA GPU support
Latest NVIDIA drivers: Refer to the NVIDIA driver installation guide
NVIDIA Container Toolkit: Follow the installation guide

Hugging Face account: Obtain an access token and set it as an environment variable:

export HF_TOKEN="your_huggingface_token"
export HF_TOKEN="your_huggingface_token"

Set up your environment

From here on, you should be running commands on the system with the NVIDIA GPU. If you haven't already, open a shell to that system now.

Clone the MAX repository, navigate to the benchmark folder, and install the dependencies in a virtual environment with the following commands:

git clone -b stable https://github.com/modular/modular.git

cd max/benchmark

pixi shell
git clone -b stable https://github.com/modular/modular.git

cd max/benchmark

pixi shell

To exit the pixi shell simply run exit.

Prepare benchmarking dataset (optional)

This tutorial uses the --dataset-name argument in our benchmark script to automatically download the sharegpt or code-debug datasets for benchmarking.

You can optionally provide a path to your own dataset using the --dataset-path argument. For example, you can download the ShareGPT dataset with the following command:

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

You can then reference the local dataset using the --dataset-path argument:

python benchmark_serving.py \
  ...
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
python benchmark_serving.py \
  ...
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \

For more information on available benchmarking datasets, see Command line arguments for benchmark_serving.py.

Start the model endpoint

We provide a pre-configured GPU-enabled Docker container that simplifies the process do deploy an endpoint with MAX. For more information, see MAX container.

To pull and run the MAX container that hosts Llama 3 as an endpoint, run this command:

docker run --rm --gpus=all \
  --ipc=host \
  -p 8000:8000 \
  --env "HF_TOKEN=${HF_TOKEN}" \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  modular/max-nvidia-full:latest \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --devices gpu:0,1,2,3 \
  --max-num-steps 10 \
  --max-batch-size 512
docker run --rm --gpus=all \
  --ipc=host \
  -p 8000:8000 \
  --env "HF_TOKEN=${HF_TOKEN}" \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  modular/max-nvidia-full:latest \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --devices gpu:0,1,2,3 \
  --max-num-steps 10 \
  --max-batch-size 512

where --devices gpu:0,1,2,3 refers to the GPU IDs to use. Note that Llama3.3-70B requires 4xH100 or 4xA100 instances to run in bfloat16 precision.

You can explore other model options in the MAX model repository.

These settings work well on H100 GPUs. You can adjust --max-batch-size depending on your system's available resources such as GPU memory.

You'll know that the server is running when you see the following log:

Server ready on http://0.0.0.0:8000

Server ready on http://0.0.0.0:8000

Start benchmarking

To benchmark MAX with 8 prompts from the code_debug dataset, run this command:

python benchmark_serving.py \
  --backend modular \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dataset-name code_debug \
  --endpoint /v1/completions \
  --num-prompts 8 \
  --collect-gpu-stats
python benchmark_serving.py \
  --backend modular \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dataset-name code_debug \
  --endpoint /v1/completions \
  --num-prompts 8 \
  --collect-gpu-stats

For more information on available arguments, see the MAX benchmarking reference.

Optional cleanup

Here's how to clean up the Docker image:

docker rmi $(docker images -q modular/max-nvidia-full:latest)
docker rmi $(docker images -q modular/max-nvidia-full:latest)

Interpret the results

The output should look similar to the following:

============ Serving Benchmark Result ============
Successful requests:                     8
Failed requests:                         0
Benchmark duration (s):                  90.00
Total input tokens:                      712840
Total generated tokens:                  16
Request throughput (req/s):              0.09
Input token throughput (tok/s):          7920.01
Output token throughput (tok/s):         0.18
---------------Time to First Token----------------
Mean TTFT (ms):                          46506.48
Median TTFT (ms):                        44050.82
P99 TTFT (ms):                           88887.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17790.64
Median TPOT (ms):                        17292.79
P99 TPOT (ms):                           38986.51
---------------Inter-token Latency----------------
Mean ITL (ms):                           17790.57
Median ITL (ms):                         17292.70
P99 ITL (ms):                            38986.49
-------------------Token Stats--------------------
Max input tokens:                        109256
Max output tokens:                       2
Max total tokens:                        109258
--------------------GPU Stats---------------------
GPU Utilization (%):                     99.24
Peak GPU Memory Used (MiB):              76312.88
GPU Memory Available (MiB):              5030.75
==================================================
============ Serving Benchmark Result ============
Successful requests:                     8
Failed requests:                         0
Benchmark duration (s):                  90.00
Total input tokens:                      712840
Total generated tokens:                  16
Request throughput (req/s):              0.09
Input token throughput (tok/s):          7920.01
Output token throughput (tok/s):         0.18
---------------Time to First Token----------------
Mean TTFT (ms):                          46506.48
Median TTFT (ms):                        44050.82
P99 TTFT (ms):                           88887.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17790.64
Median TPOT (ms):                        17292.79
P99 TPOT (ms):                           38986.51
---------------Inter-token Latency----------------
Mean ITL (ms):                           17790.57
Median ITL (ms):                         17292.70
P99 ITL (ms):                            38986.49
-------------------Token Stats--------------------
Max input tokens:                        109256
Max output tokens:                       2
Max total tokens:                        109258
--------------------GPU Stats---------------------
GPU Utilization (%):                     99.24
Peak GPU Memory Used (MiB):              76312.88
GPU Memory Available (MiB):              5030.75
==================================================

For more information about each metric, see the MAX benchmarking key metrics.

Measure latency with finite request rates

Latency metrics like time-to-first-token (TTFT) and time per output token (TPOT) matter most when the server isn't overloaded. An overloaded server will queue requests, which results in a massive increase in latency that varies depending on the size of the benchmark more than the actual latency of the server—larger benchmarks result in a deeper queue.

If you'd like to vary the size of the queue, you can adjust the request rate with the --request-rate <N> flag. This creates a stochastic request load with an average rate of N requests per second.

Comparing to alternatives

You can run the benchmarking script using the Modular, vLLM, or TensorRT-LLM backends to compare performance with alternative LLM serving frameworks. When using the TensorRT-LLM backend, be sure to change the --endpoint to /v2/models/ensemble/generate_stream. MAX achieves competitive throughput on most workloads and will further improve with upcoming optimizations.

Next steps

Now that you have detailed benchmarking results for Llama 3 on MAX using an NVIDIA H100 GPU, here are some other topics to explore next:

Deploy Llama 3 on GPU with MAX

Learn how to deploy Llama 3 on GPU with MAX.

Deploy Llama 3 on GPU-powered Kubernetes clusters

Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs

Bring your own fine-tuned model to MAX pipelines

Learn how to customize your own model in MAX pipelines.

Get started with MAX Graph in Python

Learn how to build a model graph with our Python API for inference with MAX Engine.

To read more about our performance methodology, check our our blog post, MAX GPU: State of the Art Throughput on a New GenAI platform.

You can also share your experience on the Modular Forum and in our Discord Community. Be sure to stay up to date with all the performance improvements coming soon by signing up for our newsletter.

Requirements​

Set up your environment​

Prepare benchmarking dataset (optional)​

Start the model endpoint​

Start benchmarking​

Interpret the results​

Measure latency with finite request rates​

Comparing to alternatives​

Next steps​