max CLI
The max CLI tool accelerates GenAI tasks by creating optimized inference
pipelines with OpenAI-compatible
endpoints. It
supports models from Hugging Face
and MAX Graph optimized versions of models.
Generate text or start an OpenAI-compatible endpoint with a single command using
the max CLI tool.
Install
Create a Python project to install our APIs and the max CLI.
- pixi
- uv
- pip
- conda
- If you don't have it, install pixi:curl -fsSL https://pixi.sh/install.sh | shThen restart your terminal for the changes to take effect. 
- Create a project:pixi init example-project \ -c https://conda.modular.com/max-nightly/ -c conda-forge \ && cd example-project
- Install the modularconda package:- Nightly
- Stable
 pixi add modularpixi add "modular==25.6"
- Start the virtual environment:pixi shell
- If you don't have it, install uv:curl -LsSf https://astral.sh/uv/install.sh | shThen restart your terminal to make uvaccessible.
- Create a project:uv init example-project && cd example-project
- Create and start a virtual environment:uv venv && source .venv/bin/activate
- Install the modularPython package:- Nightly
- Stable
 uv pip install modular \ --index-url https://dl.modular.com/public/nightly/python/simple/ \ --prerelease allowuv pip install modular \ --extra-index-url https://modular.gateway.scarf.sh/simple/
- Create a project folder:mkdir example-project && cd example-project
- Create and activate a virtual environment:python3 -m venv .venv/example-project \ && source .venv/example-project/bin/activate
- Install the modularPython package:- Nightly
- Stable
 pip install --pre modular \ --index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \ --extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install conda. A common choice is with brew:brew install miniconda
- Initialize condafor shell interaction:conda initIf you're on a Mac, instead use: conda init zshThen restart your terminal for the changes to take effect. 
- Create a project:conda create -n example-project
- Start the virtual environment:conda activate example-project
- Install the modularconda package:- Nightly
- Stable
 conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modularconda install -c conda-forge -c https://conda.modular.com/max/ modular
When you install the modular package, you'll get access to the max CLI tool
automatically. You can check your version like this:
max --versionRun your first model
Now that you have max installed, you can run your first model:
max generate --model google/gemma-3-12b-it \
  --prompt "Generate a story about a robot"Commands
max provides the following commands.
You can also print the available commands and documentation with --help.
For example:
max --helpmax serve --helpmax benchmark
Runs comprehensive benchmark tests on an active model server to measure performance metrics including throughput, latency, and resource utilization.
max benchmark [OPTIONS]Before running this command, make sure the model server is running, via max serve.
Example
Benchmark the google/gemma-3-27b-it model already running on localhost:
max benchmark \
  --model google/gemma-3-27b-it \
  --backend modular \
  --endpoint /v1/chat/completions \
  --num-prompts 50 \
  --dataset-name arxiv-summarization \
  --arxiv-summarization-input-len 12000 \
  --max-output-len 1200When it's done, you'll see the results printed to the terminal.
By default, it sends inference requests to localhost:8000, but you can change
that with the --host and --port arguments.
If you want to save the results, add the --save-result option, which creates
a JSON file in the local path with the following naming convention:
{backend}-{request_rate}qps-{model_name}-{timestamp}.jsonBut you can specify the file name with --result-filename and change the
directory with --result-dir.
Instead of passing all these benchmark options, you can instead pass a configuration file. See Configuration file below.
Options
This list of options is not exhaustive. For more information, run max benchmark --help or see the benchmarking script source
code.
- 
Backend configuration: - 
--backend: Choose frommodular(MAXv1/completionsendpoint),modular-chat(MAXv1/chat/completionsendpoint),vllm(vLLM), ortrt-llm(TensorRT-LLM)
- 
--model: Hugging Face model ID or local path
 
- 
- 
Load generation: - 
--num-prompts: Number of prompts to process (int, default:500)
- 
--request-rate: Request rate in requests/second (int, default:inf)
- 
--seed: The random seed used to sample the dataset (int, default:0)
 
- 
- 
Serving options - 
--base-url: Base URL of the API service
- 
--endpoint: Specific API endpoint (/v1/completionsor/v1/chat/completions)
- 
--tokenizer: Hugging Face tokenizer to use (can be different from model)
- 
--dataset-name: (Required; default:sharegpt) Specifies which type of benchmark dataset to use. This determines the dataset class and processing logic. See Datasets below.
- 
--dataset-path: Path to a local dataset file that overrides the default dataset source for the specifieddataset-name. The file format must match the expected format for the specifieddataset-name(such as JSON foraxolotl, JSONL forobfuscated-conversations, plain text forsonnet).
 
- 
- 
Additional options - 
--collect-gpu-stats: Report GPU utilization and memory consumption. Only works when runningmax benchmarkon the same instance as the server, and only on NVIDIA GPUs.
- 
--save-results: Saves results to a local JSON file.
- 
--config-file: Path to a YAML file containing benchmark configuration. The configuration file is a YAML file that contains key-value pairs for all your benchmark configurations (as a replacement for individual command line options). See Configuration file below.
 
- 
Output
Here's an explanation of the most important metrics printed upon completion:
- Request throughput: Number of complete requests processed per second
- Input token throughput: Number of input tokens processed per second
- Output token throughput: Number of tokens generated per second
- TTFT: Time to first token—the time from request start to first token generation
- TPOT: Time per output token—the average time taken to generate each output token
- ITL: Inter-token latency—the average time between consecutive token or token-chunk generations
If --collect-gpu-stats is set, you'll also see these:
- GPU utilization: Percentage of time during which at least one GPU kernel is being executed
- Peak GPU memory used: Peak memory usage during benchmark run
Datasets
The --dataset-name option supports several dataset names/formats you can
use for benchmarking:
- 
arxiv-summarization- Research paper summarization dataset containing academic papers with abstracts for training summarization models, from Hugging Face Datasets.
- 
axolotl- Local dataset in Axolotl format with conversation segments labeled as human/assistant text, from Hugging Face Datasets.
- 
code_debug- Long-context code debugging dataset containing code with multiple choice debugging questions for testing long-context understanding, from Hugging Face Datasets.
- 
obfuscated-conversations- Local dataset with obfuscated conversation data. You must pair this with the--dataset-pathoption to specify the local JSONL file.
- 
random- Synthetically generated random dataset that creates random token sequences with configurable input/output lengths and distributions.
- 
sharegpt- Conversational dataset containing human-AI conversations for chat model evaluation, from Hugging Face Datasets.
- 
sonnet- Poetry dataset using local text files containing poem lines, from Hugging Face Datasets.
- 
vision-arena- Vision-language benchmark dataset containing images with associated questions for multimodal model evaluation, from Hugging Face Datasets.
You can override the default dataset source for any of these using the
--dataset-path option (except for generated datasets like random), but you
must always specify a --dataset-name so the tool knows how to process the
dataset format.
Configuration file
The --config-file option allows you to specify a YAML file containing all
your benchmark configurations, as a replacement for individual command line
options. Simply define all the configuration options (corresponding to the max benchmark command line options) in a YAML file, all nested under the
benchmark_config key.
For example, without a configuration file, you must specify all configurations with command line options like this:
max benchmark \
  --model google/gemma-3-27b-it \
  --backend modular \
  --endpoint /v1/chat/completions \
  --host localhost \
  --port 8000 \
  --num-prompts 50 \
  --dataset-name arxiv-summarization \
  --arxiv-summarization-input-len 12000 \
  --max-output-len 1200Instead, you can create a configuration file:
benchmark_config:
  model: google/gemma-3-27b-it
  backend: modular
  endpoint: /v1/chat/completions
  host: localhost
  port: 8000
  num_prompts: 50
  dataset_name: arxiv-summarization
  arxiv_summarization_input_len: 12000
  max_output_len: 1200And then run the benchmark by passing that file:
max benchmark --config-file gemma-benchmark.yamlFor more information about running benchmarks, see the benchmarking tutorial.
max encode
Converts input text into embeddings for semantic search, text similarity, and NLP applications.
max encode [OPTIONS]Example
Basic embedding generation:
max encode \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --prompt "Convert this text into embeddings"max generate
Performs text generation based on a provided prompt.
max generate [OPTIONS]Examples
Text generation:
max generate \
  --model google/gemma-3-12b-it \
  --prompt "Generate a story about a robot"Text generation with controls:
max generate \
  --model google/gemma-3-12b-it \
  --max-length 1024 \
  --max-new-tokens 500 \
  --top-k 40 \
  --temperature 0.7 \
  --seed 42 \
  --prompt "Explain quantum computing"Process an image using a vision-language model given a URL to an image:
Llama 3.2 Vision
Llama Vision models take prompts with <|image|> and <|begin_of_text|> tokens.
For more information, see the Llama 3.2 Vision
documentation.
max generate \
  --model meta-llama/Llama-3.2-11B-Vision-Instruct \
  --prompt "<|image|><|begin_of_text|>What is in this image?" \
  --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
  --max-new-tokens 100 \
  --max-batch-size 1 \
  --max-length 108172Pixtral
Pixtral models take prompts with [IMG] tokens. For more information, see the
Pixtral
documentation.
max generate \
  --model mistral-community/pixtral-12b \
  --max-length 6491 \
  --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
  --prompt "<s>[INST]Describe the images.\n[IMG][/INST]"For more information on how to use the generate command with vision models,
see Image to text.
max list
Displays available model architectures and configurations, including:
- Hugging Face model repositories
- Supported encoding types
- Available cache strategies
max listmax serve
Launches an OpenAI-compatible REST API server for production deployments. For more detail, see the Serve API docs.
max serve [OPTIONS]Examples
CPU serving:
max serve --model modularai/Llama-3.1-8B-Instruct-GGUFOptimized GPU serving:
max serve \
  --model google/gemma-3-12b-it \
  --devices gpu \
  --quantization-encoding bfloat16 \
  --max-batch-size 4Production setup:
max serve \
  --model google/gemma-3-12b-it \
  --devices gpu:0 \
  --max-batch-size 8 \
  --device-memory-utilization 0.9Custom architectures
The max CLI supports loading custom model architectures through the
--custom-architectures flag. This allows you to extend MAX's capabilities with
your own model implementations:
max serve \
  --model google/gemma-3-12b-it \
  --custom-architectures path/to/module1:module1 \
  --custom-architectures path/to/module2:module2max warm-cache
Preloads and compiles the model to optimize initialization time by:
- Pre-compiling models before deployment
- Warming up the Hugging Face cache
This command is useful to run before serving a model.
max warm-cache [OPTIONS]Example:
Basic cache warming:
max warm-cache \
  --model google/gemma-3-12b-itModel configuration
Core settings for model loading and execution.
| Option | Description | Default | Values | 
|---|---|---|---|
| --custom-architectures | Load custom pipeline architectures | Module path format: folder/path/to/import:my_module | |
| --model TEXT | Model ID or local path | Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1) or a local path | |
| --model-path TEXT | Model ID or local path (alternative to --model) | Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1) or a local path | |
| --quantization-encoding | Weight encoding type | float32|bfloat16|q4_k|q4_0|q6_k|gptq | |
| --served-model-name | Override the default model name reported to clients (serve command only). | Any string identifier | |
| --weight-path PATH | Custom model weights path | Valid file path (supports multiple paths via repeated flags) | 
Device configuration
Controls hardware placement and memory usage.
| Option | Description | Default | Values | 
|---|---|---|---|
| --devices | Target devices | cpu|gpu|gpu:{id}(e.g.gpu:0,1) | |
| --device-specs | Specific device configuration | CPU | DeviceSpecformat (e.g.DeviceSpec(id=-1, device_type='cpu')) | 
| --device-memory-utilization | Device memory fraction | 0.9 | Float between 0.0 and 1.0 | 
Performance tuning
Optimization settings for batch processing, caching, and sequence handling.
| Option | Description | Default | Values | 
|---|---|---|---|
| --cache-strategy | Cache strategy | naive|continuous | |
| --kv-cache-page-size | Token count per KVCache page | 128 | Positive integer | 
| --max-batch-size | Maximum cache size per batch | 1 | Positive integer | 
| --max-ce-batch-size | Maximum context encoding batch size | 32 | Positive integer | 
| --max-length | Maximum input sequence length | The Hugging Face model's default max length is used. | Positive integer (must be less than model's max config) | 
| --max-new-tokens | Maximum tokens to generate | -1 | Integer (-1 for model max) | 
| --data-parallel-degree | Number of devices for data parallelism | 1 | Positive integer | 
Model state control
Options for saving or loading model states and handling external code
| Option | Description | Default | Values | 
|---|---|---|---|
| --force-download | Force re-download cached files | false | true|false | 
| --trust-remote-code | Allow custom Hugging Face code | false | true|false | 
| --allow-safetensors-weights-fp32-bf6-bidirectional-cast | Allow automatic bidirectional dtype casts between fp32 and bfloat16 | false | true|false | 
Generation parameters
Controls for generation behavior.
| Option | Description | Default | Values | 
|---|---|---|---|
| --enable-constrained-decoding | Enable constrained generation | false | true|false | 
| --enable-echo | Enable model echo | false | true|false | 
| --image_url | URLs of images to include with prompt. Ignored if model doesn't support image inputs | [] | List of valid URLs | 
| --rope-type | RoPE type for GGUF weights | none|normal|neox | |
| --seed | Random seed for generation reproducibility | Integer value | |
| --temperature | Sampling temperature for generation randomness | 1.0 | Float value (0.0 to 2.0) | 
| --top-k | Limit sampling to top K tokens | 255 | Positive integer (1 for greedy sampling) | 
| --chat-template | Custom chat template for the model | Valid chat template string | 
Server configuration
Network settings for server deployment.
| Option | Description | Default | Values | 
|---|---|---|---|
| --host | Host address to bind the server to | localhost | IP address or hostname | 
| --port | Port number to bind the server to | 8000 | Port number | 
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
