
Benchmark MAX on NVIDIA or AMD GPUs
Performance optimization is a key challenge in deploying AI inference
workloads, especially when balancing factors like accuracy, latency, and cost.
In this tutorial, we'll show you how to benchmark a Gemma
3 endpoint using the max benchmark
command. This tool provides key metrics to evaluate the performance
of the model server, including:
- Request throughput
- Input and output token throughput
- Time-to-first-token (TTFT)
- Time per output token (TPOT)
Our benchmarking script is adapted from vLLM with additional features, such as client-side GPU metric collection to ensure consistent and comprehensive performance measurement that's tailored to MAX. You can see the benchmark script source here.
System requirements:
Linux
WSL
GPU
Docker
Get access to the model
From here on, you should be running commands on the system with your GPU. If you haven't already, open a shell to that system now.
You'll first need to authorize your Hugging Face account to access the Gemma model:
-
Obtain a Hugging Face access token and set it as an environment variable:
export HF_TOKEN="hf_..."
-
Agree to the Gemma 3 license on Hugging Face.
Start the model endpoint
We provide a pre-configured GPU-enabled Docker container that simplifies the process to deploy an endpoint with MAX. For more information, see MAX container.
Use this command to pull the MAX container and start the model endpoint:
- NVIDIA
- AMD
docker run --rm --gpus=all \
--ipc=host \
-p 8000:8000 \
--env "HF_TOKEN=${HF_TOKEN}" \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
modular/max-nvidia-full:latest \
--model-path google/gemma-3-27b-it
docker run \
--device /dev/kfd \
--device /dev/dri \
--group-add video \
--ipc=host \
-p 8000:8000 \
--env "HF_TOKEN=${HF_TOKEN}" \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
modular/max-amd:latest \
--model-path google/gemma-3-27b-it
If you want to try a different model, see our model repository.
The server is running when you see the following terminal message (beware Docker prints JSON logs by default):
🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Start benchmarking
Open a second terminal and install the modular
package to get the max
CLI
tool we'll use to perform benchmarking.
Set up your environment
- pixi
- uv
- pip
- conda
- If you don't have it, install
pixi
:curl -fsSL https://pixi.sh/install.sh | sh
Then restart your terminal for the changes to take effect.
- Create a project:
pixi init max-benchmark \ -c https://conda.modular.com/max-nightly/ -c conda-forge \ && cd max-benchmark
- Install the
modular
conda package:- Nightly
- Stable
pixi add modular
pixi add "modular=25.6"
- Start the virtual environment:
pixi shell
- If you don't have it, install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init max-benchmark && cd max-benchmark
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \ --index-url https://dl.modular.com/public/nightly/python/simple/ \ --prerelease allow
uv pip install modular \ --extra-index-url https://modular.gateway.scarf.sh/simple/
- Create a project folder:
mkdir max-benchmark && cd max-benchmark
- Create and activate a virtual environment:
python3 -m venv .venv/max-benchmark \ && source .venv/max-benchmark/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
pip install --pre modular \ --index-url https://dl.modular.com/public/nightly/python/simple/
pip install modular \ --extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install conda. A common choice is with
brew
:brew install miniconda
- Initialize
conda
for shell interaction:conda init
If you're on a Mac, instead use:
conda init zsh
Then restart your terminal for the changes to take effect.
- Create a project:
conda create -n max-benchmark
- Start the virtual environment:
conda activate max-benchmark
- Install the
modular
conda package:- Nightly
- Stable
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
Benchmark the model
To benchmark MAX with the sonnet
dataset, use this command:
max benchmark \
--model google/gemma-3-27b-it \
--backend modular \
--endpoint /v1/chat/completions \
--dataset-name sonnet \
--num-prompts 500 \
--sonnet-input-len 550 \
--output-lengths 256 \
--sonnet-prefix-len 200
When you want to save your own benchmark configurations, you can define them in
a YAML file and pass it to the --config-file
option. For example, copy our
gemma-3-27b-sonnet-decode-heavy-prefix200.yaml
file from GitHub, and you can benchmark the same model with this command:
max benchmark --config-file gemma-3-27b-sonnet-decode-heavy-prefix200.yaml
For more information, including other datasets and configuration options, see
the max benchmark
documentation.
Use your own dataset
The command above uses the sonnet
dataset from Hugging Face, but you can
also provide a path to your own dataset.
For example, you can download the ShareGPT dataset with this command:
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
You can then use the local dataset with the --dataset-path
argument:
max benchmark \
...
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
Interpret the results
Of course, your results depend on your hardware, but the structure of the output should look like this:
============ Serving Benchmark Result ============
Successful requests: 50
Failed requests: 0
Benchmark duration (s): 25.27
Total input tokens: 12415
Total generated tokens: 11010
Total nonempty serving response chunks: 11010
Input request rate (req/s): inf
Request throughput (req/s): 1.97837
------------Client Experience Metrics-------------
Max Concurrency: 50
Mean input token throughput (tok/s): 282.37
Std input token throughput (tok/s): 304.38
Median input token throughput (tok/s): 140.81
P90 input token throughput (tok/s): 9.76
P95 input token throughput (tok/s): 7.44
P99 input token throughput (tok/s): 4.94
Mean output token throughput (tok/s): 27.31
Std output token throughput (tok/s): 8.08
Median output token throughput (tok/s): 30.64
P90 output token throughput (tok/s): 12.84
P95 output token throughput (tok/s): 9.11
P99 output token throughput (tok/s): 4.71
---------------Time to First Token----------------
Mean TTFT (ms): 860.54
Std TTFT (ms): 228.57
Median TTFT (ms): 809.41
P90 TTFT (ms): 1214.68
P95 TTFT (ms): 1215.34
P99 TTFT (ms): 1215.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 46.72
Std TPOT (ms): 39.77
Median TPOT (ms): 32.63
P90 TPOT (ms): 78.24
P95 TPOT (ms): 111.87
P99 TPOT (ms): 216.31
---------------Inter-token Latency----------------
Mean ITL (ms): 31.16
Std ITL (ms): 91.79
Median ITL (ms): 1.04
P90 ITL (ms): 176.93
P95 ITL (ms): 272.52
P99 ITL (ms): 276.72
-------------Per-Request E2E Latency--------------
Mean Request Latency (ms): 7694.01
Std Request Latency (ms): 6284.40
Median Request Latency (ms): 5667.19
P90 Request Latency (ms): 16636.07
P95 Request Latency (ms): 21380.10
P99 Request Latency (ms): 25251.18
For more information about each metric, see the MAX benchmarking key metrics.
Measure latency with finite request rates
Latency metrics like time-to-first-token (TTFT) and time per output token (TPOT) matter most when the server isn't overloaded. An overloaded server will queue requests, which results in a massive increase in latency that varies depending on the size of the benchmark more than the actual latency of the server. Benchmarks with a larger number of prompts result in a deeper queue.
If you'd like to vary the size of the queue, you can adjust the request rate
with the --request-rate
flag. This creates a stochastic request load with
an average rate of N
requests per second.
Comparing to alternatives
You can run the benchmarking script using the Modular, vLLM, or TensorRT-LLM backends to compare performance with alternative LLM serving frameworks. Before running the benchmark, make sure you set up and launch the corresponding inference engine so the script can send requests to it.
When using the TensorRT-LLM backend, be sure to change the --endpoint
to
/v2/models/ensemble/generate_stream
. MAX achieves competitive throughput on
most workloads and will further improve with upcoming optimizations.
Next steps
Now that you have detailed benchmarking results for Gemma 3 on MAX using an NVIDIA or AMD GPU, here are some other topics to explore next:
Deploy Llama 3 on GPU with MAX
Learn how to deploy Llama 3 on GPU with MAX.
Deploy Llama 3 on GPU-powered Kubernetes clusters
Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs
To read more about our performance methodology, check our our blog post, MAX GPU: State of the Art Throughput on a New GenAI platform.
You can also share your experience on the Modular Forum and in our Discord Community. Be sure to stay up to date with all the performance improvements coming soon by signing up for our newsletter.
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!