Skip to main content

max benchmark

Runs comprehensive benchmark tests on an active model server to measure performance metrics including throughput, latency, and resource utilization. For a complete walkthrough, see the tutorial to benchmark MAX on a GPU.

Before running this command, make sure the model server is running, via max serve.

For example, here's how to benchmark the google/gemma-3-27b-it model already running on localhost:

max benchmark \
  --model google/gemma-3-27b-it \
  --backend modular \
  --endpoint /v1/chat/completions \
  --num-prompts 50 \
  --dataset-name arxiv-summarization \
  --arxiv-summarization-input-len 12000 \
  --max-output-len 1200

When it's done, you'll see the results printed to the terminal.

By default, it sends inference requests to localhost:8000, but you can change that with the --host and --port arguments.

To save the results to a JSON file, set --result-filename to the path you want (the value can include a directory, which is created if needed):

max benchmark ... --result-filename results/gemma-run.json

Instead of passing all these benchmark options, you can pass a configuration file. See Configuration file below.

Usage​

Run max benchmark with one or more options:

max benchmark [OPTIONS]

Options​

The full option list is long. The most useful options group as follows. For everything else, run max benchmark --help or see the benchmarking script source code.

  • Backend configuration:

    • --backend: Server type to benchmark. Choices: modular, modular-chat, vllm, vllm-chat, sglang, sglang-chat, trtllm, trtllm-chat. Default: modular.

    • --model: Hugging Face model ID or local path.

    • --endpoint: Specific API endpoint, such as /v1/completions or /v1/chat/completions. Default: /v1/chat/completions.

    • --base-url: Base URL of the API service. Overrides --host and --port when set.

    • --host: Server host. Default: localhost.

    • --port: Server port. Default: 8000.

    • --tokenizer: Hugging Face tokenizer to use. Defaults to the model's tokenizer.

  • Load generation:

    • --num-prompts: Number of prompts to process. Default: unset (driven by the dataset and duration).

    • --request-rate: Requests per second. Accepts a single value or a comma-separated sweep (such as 1,2,4,8). Default: inf (no rate limit).

    • --max-concurrency: Maximum concurrent requests. Accepts a single integer or a comma-separated sweep.

    • --seed: Random seed used to sample the dataset. Default: 0.

  • Dataset selection:

    • --dataset-name: Dataset to benchmark on. Determines the dataset class and processing logic. Default: sharegpt. See Datasets below.

    • --dataset-path: Path to a local dataset file that overrides the default source for the chosen --dataset-name.

  • Output control:

    • --max-output-len: Maximum output length per request, in tokens.

    • --temperature, --top-p, --top-k: Sampling parameters forwarded to the server.

  • LoRA traffic:

    • --lora: Optional LoRA name to send with each request.

    • --lora-paths: Paths to existing LoRA adapters. Each entry is either path or name=path.

    • --lora-uniform-traffic-ratio: Probability (between 0.0 and 1.0) that any given request targets a randomly selected LoRA instead of the base model. Default: 0.0.

    • --per-lora-traffic-ratio: Per-adapter traffic ratios, in the same order as --lora-paths. Sum must not exceed 1.0; the remainder goes to the base model. Overrides --lora-uniform-traffic-ratio when set.

    • --max-concurrent-lora-ops: Maximum concurrent LoRA load and unload operations. Default: 1.

  • Result saving:

    • --result-filename: Path to a JSON file for benchmark results. When unset, no file is written. The path may include directories that the command creates if they don't exist.

    • --metadata: Key-value pairs (such as --metadata version=0.3.3 tp=1) recorded alongside the run in the result JSON.

    • --log-dir: Directory for log output. Default: <backend>-latency-Y.m.d-H.M.S.

  • Stats collection:

    • --collect-gpu-stats / --no-collect-gpu-stats: Report GPU utilization and memory consumption (NVIDIA only). Enabled by default. Only works when max benchmark runs on the same instance as the server.

    • --collect-cpu-stats / --no-collect-cpu-stats: Report CPU stats. Enabled by default.

    • --collect-server-stats / --no-collect-server-stats: Report server stats. Enabled by default.

  • Configuration file:

    • --config-file: Path to a YAML file containing all benchmark options. Replaces individual command line flags. See Configuration file below.

Datasets​

The --dataset-name option supports the following datasets. For any dataset that has configurable flags, those flags are listed inline.

You can override the default data source for most datasets using --dataset-path. You must always set --dataset-name so the tool knows how to process the file.

Text​

  • sharegpt (default): Conversational dataset with human-AI exchanges, from Hugging Face Datasets.

  • axolotl: Local dataset in Axolotl format with human/assistant conversation segments. Pair with --dataset-path.

  • obfuscated-conversations: Local obfuscated conversation dataset. Pair with --dataset-path to point at a local JSONL file.

    • --obfuscated-conversations-average-output-len: Average output length when per-request output lengths are not provided. Default: 175.
    • --obfuscated-conversations-coefficient-of-variation: Coefficient of variation for output length. Default: 0.1.
    • --obfuscated-conversations-shuffle / --no-obfuscated-conversations-shuffle: Shuffle the dataset. Disabled by default.
  • arxiv-summarization: Research paper summarization dataset, from Hugging Face Datasets.

    • --arxiv-summarization-input-len: Input tokens per request. Default: 15000.
  • sonnet: Poetry dataset using local text files of poem lines.

    • --sonnet-input-len: Input tokens per request. Default: 550.
    • --sonnet-prefix-len: Shared prefix tokens per request. Default: 200.
  • random: Synthetically generated dataset with configurable token distributions.

    • --random-input-len: Input tokens per request. Accepts a constant or a distribution string: N(mean,std), U(lower,upper), DU(lower,upper), NB(n,p), G(shape,scale), or LN(mean,std). Use ; to set separate distributions for the first and subsequent turns (for example, N(2048,200);N(512,50)). Default: 1024.
    • --random-output-len: Output tokens per request. Same format as --random-input-len. Default: 128.
    • --random-num-turns: Turns per session. Same format as --random-input-len. Default: 1.
    • --random-sys-prompt-ratio: Fraction of the input length to use as a system prompt. Range: 0.0–1.0. Default: 0.0.
    • --random-max-num-unique-sys-prompt: Maximum number of distinct system prompts to generate. Default: 1.
    • --warm-shared-prefix / --no-warm-shared-prefix: Send each unique shared prefix as a single-token request before the run to prime prefix-cache KV entries. Requires --random-sys-prompt-ratio > 0. Disabled by default.
    • --random-image-count: Images to attach per request (enables vision mode on this dataset). Default: 0.
    • --random-image-size: Pixel dimensions of generated images (for example, 512x512). Used with --random-image-count.
  • synthetic: Synthetic text generation workload with multiturn support. Also supports --warm-shared-prefix (see random above).

Code​

  • instruct-coder: Instruction-following coding dataset with multiturn support.

  • agentic-code: Multiturn agentic coding workload with tool-call turns.

  • code_debug: Long-context code debugging dataset with multiple-choice questions, from Hugging Face Datasets.

Vision​

  • batch-job: Batch image workload.

    • --batch-job-image-dir: Directory where the server can access images (file reference mode). When unset, images are embedded as base64.
  • local-image: Local images for vision benchmarks. Pair with --dataset-path.

  • vision-arena: Vision-language benchmark dataset with images and associated questions for multimodal model evaluation, from Hugging Face Datasets.

  • synthetic-pixel: Synthetic pixel-generation workload for image-output backends.

Configuration file​

The --config-file option points at a YAML file containing all benchmark options as a replacement for individual command line flags. Define every option (corresponding to a max benchmark flag) under a top-level benchmark_config key.

For example, instead of specifying configurations on the command line like this:

max benchmark \
  --model google/gemma-3-27b-it \
  --backend modular \
  --endpoint /v1/chat/completions \
  --host localhost \
  --port 8000 \
  --num-prompts 50 \
  --dataset-name arxiv-summarization \
  --arxiv-summarization-input-len 12000 \
  --max-output-len 1200

Create this configuration file:

gemma-benchmark.yaml
benchmark_config:
  model: google/gemma-3-27b-it
  backend: modular
  endpoint: /v1/chat/completions
  host: localhost
  port: 8000
  num_prompts: 50
  dataset_name: arxiv-summarization
  arxiv_summarization_input_len: 12000
  max_output_len: 1200

Then run the benchmark by passing that file:

max benchmark --config-file gemma-benchmark.yaml

For more config file examples, see our benchmark configs on GitHub.

For a walkthrough of setting up an endpoint and running a benchmark, see the quickstart guide.

Output​

Each run prints the following metrics on completion:

  • Request throughput: number of complete requests processed per second.
  • Input token throughput: number of input tokens processed per second.
  • Output token throughput: number of tokens generated per second.
  • TTFT (time to first token): time from request start to first token generation.
  • TPOT (time per output token): average time taken to generate each output token.
  • ITL (inter-token latency): average time between consecutive token or token-chunk generations.

When --collect-gpu-stats is enabled, the run also reports:

  • GPU utilization: percentage of time during which at least one GPU kernel is executing.
  • Peak GPU memory used: peak memory usage during the benchmark run.