max benchmark
Runs comprehensive benchmark tests on an active model server to measure performance metrics including throughput, latency, and resource utilization. For a complete walkthrough, see the tutorial to benchmark MAX on a GPU.
Before running this command, make sure the model server is running, via max serve.
For example, here's how to benchmark the google/gemma-3-27b-it model
already running on localhost:
max benchmark \
--model google/gemma-3-27b-it \
--backend modular \
--endpoint /v1/chat/completions \
--num-prompts 50 \
--dataset-name arxiv-summarization \
--arxiv-summarization-input-len 12000 \
--max-output-len 1200When it's done, you'll see the results printed to the terminal.
By default, it sends inference requests to localhost:8000, but you can change
that with the --host and --port arguments.
If you want to save the results, add the --save-result option, which creates
a JSON file in the local path with the following naming convention:
{backend}-{request_rate}qps-{model_name}-{timestamp}.jsonBut you can specify the file name with --result-filename and change the
directory with --result-dir.
Instead of passing all these benchmark options, you can instead pass a configuration file. See Configuration file below.
Usage
max benchmark [OPTIONS]Options
This list of options is not exhaustive. For more information, run max benchmark --help or see the benchmarking script source
code.
-
Backend configuration:
-
--backend: Choose frommodular(MAXv1/completionsendpoint),modular-chat(MAXv1/chat/completionsendpoint), orvllm(vLLM) -
--model: Hugging Face model ID or local path
-
-
Load generation:
-
--num-prompts: Number of prompts to process (int, default:500) -
--request-rate: Request rate in requests/second (int, default:inf) -
--seed: The random seed used to sample the dataset (int, default:0)
-
-
Serving options
-
--base-url: Base URL of the API service -
--endpoint: Specific API endpoint (/v1/completionsor/v1/chat/completions) -
--tokenizer: Hugging Face tokenizer to use (can be different from model) -
--dataset-name: (Required; default:sharegpt) Specifies which type of benchmark dataset to use. This determines the dataset class and processing logic. See Datasets below. -
--dataset-path: Path to a local dataset file that overrides the default dataset source for the specifieddataset-name. The file format must match the expected format for the specifieddataset-name(such as JSON foraxolotl, JSONL forobfuscated-conversations, plain text forsonnet).
-
-
Additional options
-
--collect-gpu-stats: Report GPU utilization and memory consumption for both NVIDIA and AMD GPUs. Only works when runningmax benchmarkon the same instance as the server. -
--save-results: Saves results to a local JSON file.
-
-
LoRA benchmarking options
The benchmark script supports testing LoRA adapter performance for supported models and target modules:
-
--num-loras: Number of LoRA adapters to test. If > 0, test LoRA adapters will be generated. -
--lora-rank: LoRA rank (r parameter) for generated adapters. Controls the dimension of the low-rank decomposition. -
--lora-output-dir: Directory to save generated LoRA adapters. Defaults to/tmp/loras. -
--lora-paths: Paths to existing LoRA adapters to use instead of generating new ones. -
--lora-request-ratio: Ratio of requests to send with LoRA adapters (0.0-1.0). For example, 0.5 means 50% of requests use LoRA. -
--max-num-loras: Maximum number of LoRA adapters cached on GPU. This should match the server configuration. -
--lora-target-modules: List of module names to apply LoRA to when generating random test adapters (e.g.,q_proj,k_proj,v_proj,o_proj). Only used when--num-loras> 0 and generating adapters (not when using existing--lora-paths). -
--config-file: Path to a YAML file containing benchmark configuration. The configuration file is a YAML file that contains key-value pairs for all your benchmark configurations (as a replacement for individual command line options). See Configuration file below.
-
Datasets
The --dataset-name option supports several dataset names/formats you can
use for benchmarking:
-
arxiv-summarization- Research paper summarization dataset containing academic papers with abstracts for training summarization models, from Hugging Face Datasets. -
axolotl- Local dataset in Axolotl format with conversation segments labeled as human/assistant text, from Hugging Face Datasets. -
code_debug- Long-context code debugging dataset containing code with multiple choice debugging questions for testing long-context understanding, from Hugging Face Datasets. -
obfuscated-conversations- Local dataset with obfuscated conversation data. You must pair this with the--dataset-pathoption to specify the local JSONL file. -
random- Synthetically generated random dataset that creates random token sequences with configurable input/output lengths and distributions. -
sharegpt- Conversational dataset containing human-AI conversations for chat model evaluation, from Hugging Face Datasets. -
sonnet- Poetry dataset using local text files containing poem lines, from Hugging Face Datasets. -
vision-arena- Vision-language benchmark dataset containing images with associated questions for multimodal model evaluation, from Hugging Face Datasets.
You can override the default dataset source for any of these using the
--dataset-path option (except for generated datasets like random), but you
must always specify a --dataset-name so the tool knows how to process the
dataset format.
Configuration file
The --config-file option allows you to specify a YAML file containing all
your benchmark configurations, as a replacement for individual command line
options. Simply define all the configuration options (corresponding to the max benchmark command line options) in a YAML file, all nested under the
benchmark_config key.
For instance, instead of specifying all configurations in the command line like this:
max benchmark \
--model google/gemma-3-27b-it \
--backend modular \
--endpoint /v1/chat/completions \
--host localhost \
--port 8000 \
--num-prompts 50 \
--dataset-name arxiv-summarization \
--arxiv-summarization-input-len 12000 \
--max-output-len 1200Create this configuration file:
benchmark_config:
model: google/gemma-3-27b-it
backend: modular
endpoint: /v1/chat/completions
host: localhost
port: 8000
num_prompts: 50
dataset_name: arxiv-summarization
arxiv_summarization_input_len: 12000
max_output_len: 1200And then run the benchmark by passing that file:
max benchmark --config-file gemma-benchmark.yamlFor more config file examples, see our benchmark configs on GitHub.
For a walkthrough of setting up an endpoint and running a benchmark, see the quickstart guide.
Output
Here's an explanation of the most important metrics printed upon completion:
- Request throughput: Number of complete requests processed per second
- Input token throughput: Number of input tokens processed per second
- Output token throughput: Number of tokens generated per second
- TTFT: Time to first token—the time from request start to first token generation
- TPOT: Time per output token—the average time taken to generate each output token
- ITL: Inter-token latency—the average time between consecutive token or token-chunk generations
If --collect-gpu-stats is set, you'll also see these:
- GPU utilization: Percentage of time during which at least one GPU kernel is being executed
- Peak GPU memory used: Peak memory usage during benchmark run
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!