Skip to main content
Log in

Benchmark MAX Serve on an NVIDIA A100 GPU

Ehsan M. Kermani
Judy Heflin

Performance optimization is a key challenge in deploying AI inference workloads, especially when balancing factors like accuracy, latency, and cost. In this tutorial, we'll show you how to benchmark MAX Serve on an NVIDIA A100 GPU, using a Python script to evaluate key metrics, including the following:

  • Request throughput
  • Input and output token throughput
  • Time-to-first-token (TTFT)
  • Time per output token (TPOT)

Our script (benchmark_serving.py) is adapted from vLLM with additional features, such as client-side GPU metric collection to ensure consistent and comprehensive performance measurement that's tailored to MAX Serve.

Before we start the benchmark script, we'll start an endpoint running Llama 3.1 with MAX Serve. Then we'll use the benchmark_serving.py script to send a bunch of inference requests and measure the performance.

Prerequisites

To get started with this tutorial, you need the following:

Set up the environment

From here on, you should be running commands on the system with the NVIDIA GPU. If you haven't already, open a shell to that system now.

Clone the MAX repository, navigate to the benchmarking folder, and install the dependencies in a virtual environment with the following commands:

git clone https://github.com/modularml/max.git

cd max/pipelines/benchmarking

magic shell
git clone https://github.com/modularml/max.git

cd max/pipelines/benchmarking

magic shell

Download the ShareGPT dataset, which provides LLM prompts that the benchmarking script will send to the endpoint:

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Start MAX Serve

We provide a pre-configured GPU-enabled Docker container that simplifies MAX Serve deployment. For more information, see MAX container.

To pull and run the MAX container that hosts Llama 3 as an endpoint, run this command:

docker run --rm --gpus=1 \
--ipc=host \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
modular/max-openai-api:24.6.0 \
--huggingface-repo-id modularai/llama-3.1 \
--max-num-steps 10 \
--max-cache-batch-size 248 \
--max-length 2048
docker run --rm --gpus=1 \
--ipc=host \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
modular/max-openai-api:24.6.0 \
--huggingface-repo-id modularai/llama-3.1 \
--max-num-steps 10 \
--max-cache-batch-size 248 \
--max-length 2048

You'll know that the server is running when you see the following log:

Uvicorn running on http://0.0.0.0:8000
Uvicorn running on http://0.0.0.0:8000

Start benchmarking

To benchmark MAX Serve with 500 prompts from the ShareGPTv3 dataset, run this command:

python benchmark_serving.py \
--backend modular \
--base-url http://localhost:8000 \
--endpoint /v1/completions \
--model modularai/llama-3.1 \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 500 \
--collect-gpu-stats
python benchmark_serving.py \
--backend modular \
--base-url http://localhost:8000 \
--endpoint /v1/completions \
--model modularai/llama-3.1 \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 500 \
--collect-gpu-stats

For more information on available arguments, see the MAX benchmarking reference.

Interpret the results

The output should look similar to the following:

============ Serving Benchmark Result ============
Successful requests: 500
Failed requests: 0
Benchmark duration (s): 46.27
Total input tokens: 100895
Total generated tokens: 106511
Request throughput (req/s): 10.81
Input token throughput (tok/s): 2180.51
Output token throughput (tok/s): 2301.89
---------------Time to First Token----------------
Mean TTFT (ms): 15539.31
Median TTFT (ms): 15068.37
P99 TTFT (ms): 33034.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.23
Median TPOT (ms): 28.47
P99 TPOT (ms): 138.55
---------------Inter-token Latency----------------
Mean ITL (ms): 26.76
Median ITL (ms): 5.42
P99 ITL (ms): 228.45
-------------------Token Stats--------------------
Max input tokens: 933
Max output tokens: 806
Max total tokens: 1570
--------------------GPU Stats---------------------
GPU Utilization (%): 94.74
Peak GPU Memory Used (MiB): 37228.12
GPU Memory Available (MiB): 3216.25
==================================================
============ Serving Benchmark Result ============
Successful requests: 500
Failed requests: 0
Benchmark duration (s): 46.27
Total input tokens: 100895
Total generated tokens: 106511
Request throughput (req/s): 10.81
Input token throughput (tok/s): 2180.51
Output token throughput (tok/s): 2301.89
---------------Time to First Token----------------
Mean TTFT (ms): 15539.31
Median TTFT (ms): 15068.37
P99 TTFT (ms): 33034.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.23
Median TPOT (ms): 28.47
P99 TPOT (ms): 138.55
---------------Inter-token Latency----------------
Mean ITL (ms): 26.76
Median ITL (ms): 5.42
P99 ITL (ms): 228.45
-------------------Token Stats--------------------
Max input tokens: 933
Max output tokens: 806
Max total tokens: 1570
--------------------GPU Stats---------------------
GPU Utilization (%): 94.74
Peak GPU Memory Used (MiB): 37228.12
GPU Memory Available (MiB): 3216.25
==================================================

For more information about each metric, see the MAX benchmarking key metrics.

Measure latency with finite request rates

Latency metrics like time-to-first-token (TTFT) and time per output token (TPOT) matter most when the server isn't overloaded. An overloaded server will queue requests, which results in a massive increase in latency that varies depending on the size of the benchmark more than the actual latency of the server—larger benchmarks result in a deeper queue.

If you'd like to vary the size of the queue, you can adjust the request rate with the --request-rate <N> flag. This creates a stochastic request load with an average rate of N requests per second.

Comparing to alternatives

You can run the benchmarking script using the Modular, vLLM, or TensorRT-LLM backends to compare performance with alternative LLM serving frameworks. When using the TensorRT-LLM backend, be sure to change the --endpoint to /v2/models/ensemble/generate_stream. MAX achieves competitive throughput on most workloads and will further improve with upcoming optimizations.

Next steps

Now that you have detailed benchmarking results for Llama 3.1 on MAX Serve using an NVIDIA A100 GPU, here are some other topics to explore next:

To read more about our performance methodology, check our our blog post, MAX GPU: State of the Art Throughput on a New GenAI platform.

You can also share your experience on the Modular Forum and in our Discord Community. Be sure to stay up to date with all the performance improvements coming soon by signing up for our newsletter.

Did this tutorial work for you?