max-pipelines
The max-pipelines
CLI tool accelerates GenAI tasks by creating
optimized inference pipelines with OpenAI-compatible
endpoints. It
supports both PyTorch models from Hugging Face and MAX
Graph optimized versions of models like Llama
3.1, Mistral, and Replit Code.
Generate text or start an OpenAI-compatible endpoint with a single command using
the max-pipelines
CLI tool. While standard PyTorch models are supported, MAX
Graph variants provide enhanced performance.
Get started
-
If you don't have the
magic
CLI yet, you can install it on macOS and Ubuntu Linux with this command:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. -
Install the
max-pipelines
CLI tool:magic global install max-pipelines==25.1.1
magic global install max-pipelines==25.1.1
-
Run your first model:
max-pipelines generate --huggingface-repo-id=modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 14 \
--prompt "What's blue and rhymes with shoe?"max-pipelines generate --huggingface-repo-id=modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 14 \
--prompt "What's blue and rhymes with shoe?"
Update
To make sure you always have the latest version of max-pipelines
, run this
command:
magic global update
magic global update
Commands
max-pipelines
provides the following commands.
You can also print the available commands and documentation with --help
.
For example:
max-pipelines --help
max-pipelines --help
max-pipelines serve --help
max-pipelines serve --help
encode
Converts input text into embeddings for semantic search, text similarity, and NLP applications.
max-pipelines encode [OPTIONS]
max-pipelines encode [OPTIONS]
Examples
Basic embedding generation:
max-pipelines encode \
--huggingface-repo-id sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
max-pipelines encode \
--huggingface-repo-id sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
Batch processing with GPU:
max-pipelines encode \
--huggingface-repo-id modularai/e5-large \
--devices gpu \
--max-ce-batch-size 16 \
--quantization-encoding bfloat16 \
--prompt "Process multiple texts efficiently"
max-pipelines encode \
--huggingface-repo-id modularai/e5-large \
--devices gpu \
--max-ce-batch-size 16 \
--quantization-encoding bfloat16 \
--prompt "Process multiple texts efficiently"
generate
Performs text generation based on a provided prompt.
max-pipelines generate [OPTIONS]
max-pipelines generate [OPTIONS]
Examples
Text generation:
max-pipelines generate \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Write a story about a robot" \
--max-new-tokens 100
max-pipelines generate \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Write a story about a robot" \
--max-new-tokens 100
Text generation with controls:
max-pipelines generate \
--huggingface-repo-id unsloth/phi-4-GGUF \
--prompt "Explain quantum computing" \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged
max-pipelines generate \
--huggingface-repo-id unsloth/phi-4-GGUF \
--prompt "Explain quantum computing" \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged
Process an image using a vision-language model given a URL to an image:
LLama 3.2 Vision
LLama Vision models take prompts with <|image|>
and <|begin_of_text|>
tokens.
For more information, see the LLama 3.2 Vision
documentation.
max-pipelines generate \
--huggingface-repo-id meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 6491 \
--image_url https://en.wikipedia.org/wiki/Template:POTD/2025-01-01 \
--prompt="<|image|><|begin_of_text|>What is in this image?"
max-pipelines generate \
--huggingface-repo-id meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 6491 \
--image_url https://en.wikipedia.org/wiki/Template:POTD/2025-01-01 \
--prompt="<|image|><|begin_of_text|>What is in this image?"
Pixtral
Pixtral models take prompts with [IMG]
tokens. For more information, see the
Pixtral
documentation.
max-pipelines generate \
--huggingface-repo-id mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://en.wikipedia.org/wiki/Template:POTD/2025-01-01 \
--prompt="What is in this image? [IMG]"
max-pipelines generate \
--huggingface-repo-id mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://en.wikipedia.org/wiki/Template:POTD/2025-01-01 \
--prompt="What is in this image? [IMG]"
list
Displays available model architectures and configurations, including:
- Hugging Face model repositories
- Supported encoding types
- Available cache strategies
max-pipelines list
max-pipelines list
serve
Launches an OpenAI-compatible REST API server for production deployments.
max-pipelines serve [OPTIONS]
max-pipelines serve [OPTIONS]
Examples
CPU serving:
max-pipelines serve \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF
max-pipelines serve \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF
Optimized GPU serving:
max-pipelines serve \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
max-pipelines serve \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
Production setup with serialized model:
max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu-0,gpu-1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu-0,gpu-1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
warm-cache
Preloads and compiles the model to optimize initialization time by:
- Pre-compiling models before deployment
- Warming up the Hugging Face cache
- Creating serialized model files
This command is useful to run before serving a model.
max-pipelines warm-cache [OPTIONS]
max-pipelines warm-cache [OPTIONS]
Example:
Basic cache warming:
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF
Save serialized model:
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef \
--quantization-encoding bfloat16
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef \
--quantization-encoding bfloat16
Configuration options
Model configuration
Core settings for model loading and execution.
Option | Description | Default | Values |
---|---|---|---|
--engine | Backend engine | max | max |huggingface |
--huggingface-repo-id TEXT | (required) Hugging Face model repository ID | Any valid Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1 ) | |
--quantization-encoding | Weight encoding type | float32 |bfloat16 |q4_k |q4_0 |q6_k |gptq | |
--weight-path PATH | Custom model weights path | Valid file path (supports multiple paths via repeated flags) |
Device configuration
Controls hardware placement and memory usage.
Option | Description | Default | Values |
---|---|---|---|
--devices | Target devices | cpu |gpu |gpu-{id} (e.g. gpu-0,gpu-1 ) | |
--device-specs | Specific device configuration | CPU | DeviceSpec format (e.g. DeviceSpec(id=-1, device_type='cpu') ) |
--device-memory-utilization | Device memory fraction | 0.9 | Float between 0.0 and 1.0 |
Performance tuning
Optimization settings for batch processing, caching, and sequence handling.
Option | Description | Default | Values |
---|---|---|---|
--cache-strategy | Cache strategy | naive |continuous | |
--kv-cache-page-size | Token count per KVCache page | 128 | Positive integer |
--max-batch-size | Maximum cache size per batch | 1 | Positive integer |
--max-ce-batch-size | Maximum context encoding batch size | 32 | Positive integer |
--max-length | Maximum input sequence length | Positive integer (must be less than model's max config) | |
--max-new-tokens | Maximum tokens to generate | -1 | Integer (-1 for model max) |
--pad-to-multiple-of | Input tensor padding multiple | 2 | Positive integer |
Model state control
Options for saving or loading model states and handling external code
Option | Description | Default | Values |
---|---|---|---|
--force-download | Force re-download cached files | false | true |false |
--save-to-serialized-model-path | Save serialized model path | Valid file path | |
--serialized-model-path | Load serialized model path | Valid file path | |
--trust-remote-code | Allow custom Hugging Face code | false | true |false |
Generation parameters
Controls for text generation behavior.
Generation parameters
Controls for generation behavior.
Option | Description | Default | Values |
---|---|---|---|
--enable-constrained-decoding | Enable constrained generation | false | true |false |
--enable-echo | Enable model echo | false | true |false |
--image_url | URLs of images to include with prompt. Ignored if model doesn't support image inputs | [] | List of valid URLs |
--rope-type | RoPE type for GGUF weights | none |normal |neox | |
--top-k | Limit sampling to top K tokens | 1 | Positive integer (1 for greedy sampling) |
Common use cases
Fine-tuned generation
max-pipelines generate \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--quantization-encoding bfloat16 \
--max-new-tokens 200 \
--top-k 50 \
--prompt "Write a technical blog post about machine learning"
max-pipelines generate \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--quantization-encoding bfloat16 \
--max-new-tokens 200 \
--top-k 50 \
--prompt "Write a technical blog post about machine learning"
Optimized production setup
First warm cache:
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef
Then serve with optimized settings:
max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu \
-max-batch-size 8 \
--device-memory-utilization 0.9 \
--cache-strategy paged
max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu \
-max-batch-size 8 \
--device-memory-utilization 0.9 \
--cache-strategy paged
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!