Skip to main content
Log in

max-pipelines

The max-pipelines CLI tool accelerates GenAI tasks by creating optimized inference pipelines with OpenAI-compatible endpoints. It supports both PyTorch models from Hugging Face and MAX Graph optimized versions of models like Llama 3.1, Mistral, and Replit Code.

Generate text or start an OpenAI-compatible endpoint with a single command using the max-pipelines CLI tool. While standard PyTorch models are supported, MAX Graph variants provide enhanced performance.

Get started

  1. If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

    curl -ssL https://magic.modular.com/ | bash
    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Install the max-pipelines CLI tool:

    magic global install max-pipelines
    magic global install max-pipelines
  3. Run your first model:

    max-pipelines generate --model-path=modularai/Llama-3.1-8B-Instruct-GGUF \
    --prompt "Generate a story about a robot"
    max-pipelines generate --model-path=modularai/Llama-3.1-8B-Instruct-GGUF \
    --prompt "Generate a story about a robot"

Manage versions

To make sure you always have the latest version of max-pipelines, run the following command:

magic global update
magic global update

You can also specify the max-pipelines package you want to use:

magic global install max-pipelines==25.1.0
magic global install max-pipelines==25.1.0

Uninstall

To remove max-pipelines, delete the binary:

rm ~/.modular/bin/max-pipelines
rm ~/.modular/bin/max-pipelines

Commands

max-pipelines provides the following commands.

You can also print the available commands and documentation with --help. For example:

max-pipelines --help
max-pipelines --help
max-pipelines serve --help
max-pipelines serve --help

encode

Converts input text into embeddings for semantic search, text similarity, and NLP applications.

max-pipelines encode [OPTIONS]
max-pipelines encode [OPTIONS]

Example

Basic embedding generation:

max-pipelines encode \
--model-path sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
max-pipelines encode \
--model-path sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"

generate

Performs text generation based on a provided prompt.

max-pipelines generate [OPTIONS]
max-pipelines generate [OPTIONS]

Examples

Text generation:

max-pipelines generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 100 \
--prompt "Write a story about a robot"
max-pipelines generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 100 \
--prompt "Write a story about a robot"

Text generation with controls:

max-pipelines generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged \
--prompt "Explain quantum computing"
max-pipelines generate \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged \
--prompt "Explain quantum computing"

Process an image using a vision-language model given a URL to an image:

LLama 3.2 Vision

LLama Vision models take prompts with <|image|> and <|begin_of_text|> tokens. For more information, see the LLama 3.2 Vision documentation.

max-pipelines generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
max-pipelines generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172

Pixtral

Pixtral models take prompts with [IMG] tokens. For more information, see the Pixtral documentation.

max-pipelines generate \
--model-path mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--prompt="<s>[INST]Describe the images.\n[IMG][/INST]"
max-pipelines generate \
--model-path mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
--prompt="<s>[INST]Describe the images.\n[IMG][/INST]"

For more information on how to use the generate command with vision models, see Generate image descriptions with Llama 3.2 Vision.

list

Displays available model architectures and configurations, including:

  • Hugging Face model repositories
  • Supported encoding types
  • Available cache strategies
max-pipelines list
max-pipelines list

serve

Launches an OpenAI-compatible REST API server for production deployments.

max-pipelines serve [OPTIONS]
max-pipelines serve [OPTIONS]

Examples

CPU serving:

max-pipelines serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max-pipelines serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF

Optimized GPU serving:

max-pipelines serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
max-pipelines serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged

Production setup with serialized model:

max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu:0,1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu:0,1 \
--max-batch-size 8 \
--device-memory-utilization 0.9

warm-cache

Preloads and compiles the model to optimize initialization time by:

  • Pre-compiling models before deployment
  • Warming up the Hugging Face cache
  • Creating serialized model files

This command is useful to run before serving a model.

max-pipelines warm-cache [OPTIONS]
max-pipelines warm-cache [OPTIONS]

Example:

Basic cache warming:

max-pipelines warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max-pipelines warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF

Save serialized model:

max-pipelines warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef \
--quantization-encoding bfloat16
max-pipelines warm-cache \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef \
--quantization-encoding bfloat16

Configuration options

Model configuration

Core settings for model loading and execution.

OptionDescriptionDefaultValues
--engineBackend enginemaxmax|huggingface
--huggingface-repo-id TEXT(deprecated) Hugging Face model repository IDAny valid Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1)
--model-path TEXT(required) Path to modelAny valid path or Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1)
--quantization-encodingWeight encoding typefloat32|bfloat16|q4_k|q4_0|q6_k|gptq
--weight-path PATHCustom model weights pathValid file path (supports multiple paths via repeated flags)

Device configuration

Controls hardware placement and memory usage.

OptionDescriptionDefaultValues
--devicesTarget devicescpu|gpu|gpu:{id} (e.g. gpu:0,1)
--device-specsSpecific device configurationCPUDeviceSpec format (e.g. DeviceSpec(id=-1, device_type='cpu'))
--device-memory-utilizationDevice memory fraction0.9Float between 0.0 and 1.0

Performance tuning

Optimization settings for batch processing, caching, and sequence handling.

OptionDescriptionDefaultValues
--cache-strategyCache strategynaive|continuous
--kv-cache-page-sizeToken count per KVCache page128Positive integer
--max-batch-sizeMaximum cache size per batch1Positive integer
--max-ce-batch-sizeMaximum context encoding batch size32Positive integer
--max-lengthMaximum input sequence lengthThe Hugging Face model's default max length is used.Positive integer (must be less than model's max config)
--max-new-tokensMaximum tokens to generate-1Integer (-1 for model max)
--pad-to-multiple-ofInput tensor padding multiple2Positive integer

Model state control

Options for saving or loading model states and handling external code

OptionDescriptionDefaultValues
--force-downloadForce re-download cached filesfalsetrue|false
--save-to-serialized-model-pathSave serialized model pathValid file path
--serialized-model-pathLoad serialized model pathValid file path
--trust-remote-codeAllow custom Hugging Face codefalsetrue|false

Generation parameters

Controls for text generation behavior.

Generation parameters

Controls for generation behavior.

OptionDescriptionDefaultValues
--enable-constrained-decodingEnable constrained generationfalsetrue|false
--enable-echoEnable model echofalsetrue|false
--image_urlURLs of images to include with prompt. Ignored if model doesn't support image inputs[]List of valid URLs
--rope-typeRoPE type for GGUF weightsnone|normal|neox
--top-kLimit sampling to top K tokens1Positive integer (1 for greedy sampling)