max CLI
The max
CLI tool accelerates GenAI tasks by creating optimized inference
pipelines with OpenAI-compatible
endpoints. It
supports models from Hugging Face
and MAX Graph optimized versions of models
like Llama 3.1, Mistral, and Replit Code.
Generate text or start an OpenAI-compatible endpoint with a single command using
the max
CLI tool.
Install
Create a Python project to install our APIs and the max
CLI.
- pip
- uv
- conda
- pixi
- Create a project folder:
mkdir example-project && cd example-project
mkdir example-project && cd example-project
- Create and activate a virtual environment:
python3 -m venv .venv/example-project \
&& source .venv/example-project/bin/activatepython3 -m venv .venv/example-project \
&& source .venv/example-project/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init example-project && cd example-project
uv init example-project && cd example-project
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-match
- If you don't have it, install conda. A common choice is with
brew
:brew install miniconda
brew install miniconda
- Initialize
conda
for shell interaction:conda init
conda init
If you're on a Mac, instead use:
conda init zsh
conda init zsh
Then restart your terminal for the changes to take effect.
- Create a project:
conda create -n example-project
conda create -n example-project
- Start the virtual environment:
conda activate example-project
conda activate example-project
- Install the
modular
conda package:- Nightly
- Stable
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
- If you don't have it, install
pixi
:curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh
Then restart your terminal for the changes to take effect.
- Create a project:
pixi init example-project \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd example-projectpixi init example-project \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd example-project - Install the
modular
conda package:- Nightly
- Stable
pixi add modular
pixi add modular
pixi add "modular==25.3"
pixi add "modular==25.3"
- Start the virtual environment:
pixi shell
pixi shell
When you install the modular
package, you'll get access to the max
CLI tool
automatically. You can check your version like this:
max --version
max --version
Run your first model
Now that you have max
installed, you can run your first model:
max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Generate a story about a robot"
max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Generate a story about a robot"
Commands
max
provides the following commands.
You can also print the available commands and documentation with --help
.
For example:
max --help
max --help
max serve --help
max serve --help
encode
Converts input text into embeddings for semantic search, text similarity, and NLP applications.
max encode [OPTIONS]
max encode [OPTIONS]
Example
Basic embedding generation:
max encode \
--model-path sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
max encode \
--model-path sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
generate
Performs text generation based on a provided prompt.
max generate [OPTIONS]
max generate [OPTIONS]
Examples
Text generation:
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 100 \
--prompt "Generate a story about a robot"
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 100 \
--prompt "Generate a story about a robot"
Text generation with controls:
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged \
--prompt "Explain quantum computing"
max generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged \
--prompt "Explain quantum computing"
Process an image using a vision-language model given a URL to an image:
LLama 3.2 Vision
LLama Vision models take prompts with <|image|>
and <|begin_of_text|>
tokens.
For more information, see the LLama 3.2 Vision
documentation.
max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
Pixtral
Pixtral models take prompts with [IMG]
tokens. For more information, see the
Pixtral
documentation.
max generate \
--model-path mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--prompt "<s>[INST]Describe the images.\n[IMG][/INST]"
max generate \
--model-path mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--prompt "<s>[INST]Describe the images.\n[IMG][/INST]"
For more information on how to use the generate
command with vision models, see
Generate image descriptions with Llama 3.2
Vision.
list
Displays available model architectures and configurations, including:
- Hugging Face model repositories
- Supported encoding types
- Available cache strategies
max list
max list
serve
Launches an OpenAI-compatible REST API server for production deployments. For more detail, see the Serve API docs.
max serve [OPTIONS]
max serve [OPTIONS]
Examples
CPU serving:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
Optimized GPU serving:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
Production setup:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu:0,1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu:0,1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
Custom architectures
The max
CLI supports loading custom model architectures through the
--custom-architectures
flag. This allows you to extend MAX's capabilities with
your own model implementations:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--custom-architectures path/to/module1:module1 \
--custom-architectures path/to/module2:module2
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--custom-architectures path/to/module1:module1 \
--custom-architectures path/to/module2:module2
warm-cache
Preloads and compiles the model to optimize initialization time by:
- Pre-compiling models before deployment
- Warming up the Hugging Face cache
This command is useful to run before serving a model.
max warm-cache [OPTIONS]
max warm-cache [OPTIONS]
Example:
Basic cache warming:
max warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
Model configuration
Core settings for model loading and execution.
Option | Description | Default | Values |
---|---|---|---|
--custom-architectures | Load custom pipeline architectures | Module path format: folder/path/to/import:my_module | |
--engine | Backend engine | max | max |huggingface |
--model-path TEXT | (required) Path to model | Any valid path or Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1 ) | |
--quantization-encoding | Weight encoding type | float32 |bfloat16 |q4_k |q4_0 |q6_k |gptq | |
--weight-path PATH | Custom model weights path | Valid file path (supports multiple paths via repeated flags) |
Device configuration
Controls hardware placement and memory usage.
Option | Description | Default | Values |
---|---|---|---|
--devices | Target devices | cpu |gpu |gpu:{id} (e.g. gpu:0,1 ) | |
--device-specs | Specific device configuration | CPU | DeviceSpec format (e.g. DeviceSpec(id=-1, device_type='cpu') ) |
--device-memory-utilization | Device memory fraction | 0.9 | Float between 0.0 and 1.0 |
Performance tuning
Optimization settings for batch processing, caching, and sequence handling.
Option | Description | Default | Values |
---|---|---|---|
--cache-strategy | Cache strategy | naive |continuous | |
--kv-cache-page-size | Token count per KVCache page | 128 | Positive integer |
--max-batch-size | Maximum cache size per batch | 1 | Positive integer |
--max-ce-batch-size | Maximum context encoding batch size | 32 | Positive integer |
--max-length | Maximum input sequence length | The Hugging Face model's default max length is used. | Positive integer (must be less than model's max config) |
--max-new-tokens | Maximum tokens to generate | -1 | Integer (-1 for model max) |
--pad-to-multiple-of | Input tensor padding multiple | 2 | Positive integer |
Model state control
Options for saving or loading model states and handling external code
Option | Description | Default | Values |
---|---|---|---|
--force-download | Force re-download cached files | false | true |false |
--trust-remote-code | Allow custom Hugging Face code | false | true |false |
Generation parameters
Controls for generation behavior.
Option | Description | Default | Values |
---|---|---|---|
--enable-constrained-decoding | Enable constrained generation | false | true |false |
--enable-echo | Enable model echo | false | true |false |
--image_url | URLs of images to include with prompt. Ignored if model doesn't support image inputs | [] | List of valid URLs |
--rope-type | RoPE type for GGUF weights | none |normal |neox | |
--top-k | Limit sampling to top K tokens | 1 | Positive integer (1 for greedy sampling) |
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!