Skip to main content

Benchmark MAX on NVIDIA or AMD GPUs

Ehsan M. Kermani
Judy Heflin

Performance optimization is a key challenge in deploying AI inference workloads, especially when balancing factors like accuracy, latency, and cost. In this tutorial, we'll show you how to benchmark a Gemma 3 endpoint using the max benchmark command. This tool provides key metrics to evaluate the performance of the model server, including:

  • Request throughput
  • Input and output token throughput
  • Time-to-first-token (TTFT)
  • Time per output token (TPOT)

Our benchmarking script is adapted from vLLM with additional features, such as client-side GPU metric collection to ensure consistent and comprehensive performance measurement that's tailored to MAX. You can see the benchmark script source here.

System requirements:

Get access to the model

From here on, you should be running commands on the system with your GPU. If you haven't already, open a shell to that system now.

You'll first need to authorize your Hugging Face account to access the Gemma model:

  1. Obtain a Hugging Face access token and set it as an environment variable:

    export HF_TOKEN="hf_..."
  2. Agree to the Gemma 3 license on Hugging Face.

Start the model endpoint

We provide a pre-configured GPU-enabled Docker container that simplifies the process to deploy an endpoint with MAX. For more information, see MAX container.

Use this command to pull the MAX container and start the model endpoint:

docker run --rm --gpus=all \
  --ipc=host \
  -p 8000:8000 \
  --env "HF_TOKEN=${HF_TOKEN}" \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  modular/max-nvidia-full:latest \
  --model-path google/gemma-3-27b-it

If you want to try a different model, see our model repository.

The server is running when you see the following terminal message (beware Docker prints JSON logs by default):

🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Start benchmarking

Open a second terminal and install the modular package to get the max CLI tool we'll use to perform benchmarking.

Set up your environment

  1. If you don't have it, install pixi:
    curl -fsSL https://pixi.sh/install.sh | sh

    Then restart your terminal for the changes to take effect.

  2. Create a project:
    pixi init max-benchmark \
      -c https://conda.modular.com/max-nightly/ -c conda-forge \
      && cd max-benchmark
  3. Install the modular conda package:
    pixi add modular
  4. Start the virtual environment:
    pixi shell

Benchmark the model

To benchmark MAX with the sonnet dataset, use this command:

max benchmark \
  --model google/gemma-3-27b-it \
  --backend modular \
  --endpoint /v1/chat/completions \
  --dataset-name sonnet \
  --num-prompts 500 \
  --sonnet-input-len 550 \
  --output-lengths 256 \
  --sonnet-prefix-len 200

When you want to save your own benchmark configurations, you can define them in a YAML file and pass it to the --config-file option. For example, copy our gemma-3-27b-sonnet-decode-heavy-prefix200.yaml file from GitHub, and you can benchmark the same model with this command:

max benchmark --config-file gemma-3-27b-sonnet-decode-heavy-prefix200.yaml

For more information, including other datasets and configuration options, see the max benchmark documentation.

Use your own dataset

The command above uses the sonnet dataset from Hugging Face, but you can also provide a path to your own dataset.

For example, you can download the ShareGPT dataset with this command:

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

You can then use the local dataset with the --dataset-path argument:

max benchmark \
  ...
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \

Interpret the results

Of course, your results depend on your hardware, but the structure of the output should look like this:

============ Serving Benchmark Result ============
Successful requests:                     50
Failed requests:                         0
Benchmark duration (s):                  25.27
Total input tokens:                      12415
Total generated tokens:                  11010
Total nonempty serving response chunks:  11010
Input request rate (req/s):              inf
Request throughput (req/s):              1.97837
------------Client Experience Metrics-------------
Max Concurrency:                         50
Mean input token throughput (tok/s):     282.37
Std input token throughput (tok/s):      304.38
Median input token throughput (tok/s):   140.81
P90 input token throughput (tok/s):      9.76
P95 input token throughput (tok/s):      7.44
P99 input token throughput (tok/s):      4.94
Mean output token throughput (tok/s):    27.31
Std output token throughput (tok/s):     8.08
Median output token throughput (tok/s):  30.64
P90 output token throughput (tok/s):     12.84
P95 output token throughput (tok/s):     9.11
P99 output token throughput (tok/s):     4.71
---------------Time to First Token----------------
Mean TTFT (ms):                          860.54
Std TTFT (ms):                           228.57
Median TTFT (ms):                        809.41
P90 TTFT (ms):                           1214.68
P95 TTFT (ms):                           1215.34
P99 TTFT (ms):                           1215.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.72
Std TPOT (ms):                           39.77
Median TPOT (ms):                        32.63
P90 TPOT (ms):                           78.24
P95 TPOT (ms):                           111.87
P99 TPOT (ms):                           216.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           31.16
Std ITL (ms):                            91.79
Median ITL (ms):                         1.04
P90 ITL (ms):                            176.93
P95 ITL (ms):                            272.52
P99 ITL (ms):                            276.72
-------------Per-Request E2E Latency--------------
Mean Request Latency (ms):               7694.01
Std Request Latency (ms):                6284.40
Median Request Latency (ms):             5667.19
P90 Request Latency (ms):                16636.07
P95 Request Latency (ms):                21380.10
P99 Request Latency (ms):                25251.18

For more information about each metric, see the MAX benchmarking key metrics.

Measure latency with finite request rates

Latency metrics like time-to-first-token (TTFT) and time per output token (TPOT) matter most when the server isn't overloaded. An overloaded server will queue requests, which results in a massive increase in latency that varies depending on the size of the benchmark more than the actual latency of the server. Benchmarks with a larger number of prompts result in a deeper queue.

If you'd like to vary the size of the queue, you can adjust the request rate with the --request-rate flag. This creates a stochastic request load with an average rate of N requests per second.

Comparing to alternatives

You can run the benchmarking script using the Modular, vLLM, or TensorRT-LLM backends to compare performance with alternative LLM serving frameworks. Before running the benchmark, make sure you set up and launch the corresponding inference engine so the script can send requests to it.

When using the TensorRT-LLM backend, be sure to change the --endpoint to /v2/models/ensemble/generate_stream. MAX achieves competitive throughput on most workloads and will further improve with upcoming optimizations.

Next steps

Now that you have detailed benchmarking results for Gemma 3 on MAX using an NVIDIA or AMD GPU, here are some other topics to explore next:

To read more about our performance methodology, check our our blog post, MAX GPU: State of the Art Throughput on a New GenAI platform.

You can also share your experience on the Modular Forum and in our Discord Community. Be sure to stay up to date with all the performance improvements coming soon by signing up for our newsletter.

Did this tutorial work for you?