Skip to main content
Log in

max-pipelines

The max-pipelines CLI tool accelerates GenAI tasks by creating optimized inference pipelines with OpenAI-compatible endpoints. It supports both PyTorch models from Hugging Face and MAX Graph optimized versions of models like Llama 3.1, Mistral, and Replit Code.

Generate text or start an OpenAI-compatible endpoint with a single command using the max-pipelines CLI tool. While standard PyTorch models are supported, MAX Graph variants provide enhanced performance.

Get started

  1. If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

    curl -ssL https://magic.modular.com/ | bash
    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Install the max-pipelines CLI tool:

    magic global install max-pipelines==25.1.1
    magic global install max-pipelines==25.1.1
  3. Run your first model:

    max-pipelines generate --huggingface-repo-id=modularai/Llama-3.1-8B-Instruct-GGUF \
    --max-length 14 \
    --prompt "What's blue and rhymes with shoe?"
    max-pipelines generate --huggingface-repo-id=modularai/Llama-3.1-8B-Instruct-GGUF \
    --max-length 14 \
    --prompt "What's blue and rhymes with shoe?"

Update

To make sure you always have the latest version of max-pipelines, run this command:

magic global update
magic global update

Commands

max-pipelines provides the following commands.

You can also print the available commands and documentation with --help. For example:

max-pipelines --help
max-pipelines --help
max-pipelines serve --help
max-pipelines serve --help

encode

Converts input text into embeddings for semantic search, text similarity, and NLP applications.

max-pipelines encode [OPTIONS]
max-pipelines encode [OPTIONS]

Examples

Basic embedding generation:

max-pipelines encode \
--huggingface-repo-id sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"
max-pipelines encode \
--huggingface-repo-id sentence-transformers/all-MiniLM-L6-v2 \
--prompt "Convert this text into embeddings"

Batch processing with GPU:

max-pipelines encode \
--huggingface-repo-id modularai/e5-large \
--devices gpu \
--max-ce-batch-size 16 \
--quantization-encoding bfloat16 \
--prompt "Process multiple texts efficiently"
max-pipelines encode \
--huggingface-repo-id modularai/e5-large \
--devices gpu \
--max-ce-batch-size 16 \
--quantization-encoding bfloat16 \
--prompt "Process multiple texts efficiently"

generate

Performs text generation based on a provided prompt.

max-pipelines generate [OPTIONS]
max-pipelines generate [OPTIONS]

Examples

Text generation:

max-pipelines generate \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Write a story about a robot" \
--max-new-tokens 100
max-pipelines generate \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--prompt "Write a story about a robot" \
--max-new-tokens 100

Text generation with controls:

max-pipelines generate \
--huggingface-repo-id unsloth/phi-4-GGUF \
--prompt "Explain quantum computing" \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged
max-pipelines generate \
--huggingface-repo-id unsloth/phi-4-GGUF \
--prompt "Explain quantum computing" \
--max-new-tokens 500 \
--top-k 40 \
--quantization-encoding q4_k \
--cache-strategy paged

Process an image using a vision-language model given a URL to an image:

LLama 3.2 Vision

LLama Vision models take prompts with <|image|> and <|begin_of_text|> tokens. For more information, see the LLama 3.2 Vision documentation.

max-pipelines generate \
--huggingface-repo-id meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 6491 \
--image_url https://en.wikipedia.org/wiki/Template:POTD/2025-01-01 \
--prompt="<|image|><|begin_of_text|>What is in this image?"
max-pipelines generate \
--huggingface-repo-id meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 6491 \
--image_url https://en.wikipedia.org/wiki/Template:POTD/2025-01-01 \
--prompt="<|image|><|begin_of_text|>What is in this image?"

Pixtral

Pixtral models take prompts with [IMG] tokens. For more information, see the Pixtral documentation.

max-pipelines generate \
--huggingface-repo-id mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://en.wikipedia.org/wiki/Template:POTD/2025-01-01 \
--prompt="What is in this image? [IMG]"
max-pipelines generate \
--huggingface-repo-id mistral-community/pixtral-12b \
--max-length 6491 \
--image_url https://en.wikipedia.org/wiki/Template:POTD/2025-01-01 \
--prompt="What is in this image? [IMG]"

list

Displays available model architectures and configurations, including:

  • Hugging Face model repositories
  • Supported encoding types
  • Available cache strategies
max-pipelines list
max-pipelines list

serve

Launches an OpenAI-compatible REST API server for production deployments.

max-pipelines serve [OPTIONS]
max-pipelines serve [OPTIONS]

Examples

CPU serving:

max-pipelines serve \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF
max-pipelines serve \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF

Optimized GPU serving:

max-pipelines serve \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged
max-pipelines serve \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--devices gpu \
--quantization-encoding bfloat16 \
--max-batch-size 4 \
--cache-strategy paged

Production setup with serialized model:

max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu-0,gpu-1 \
--max-batch-size 8 \
--device-memory-utilization 0.9
max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu-0,gpu-1 \
--max-batch-size 8 \
--device-memory-utilization 0.9

warm-cache

Preloads and compiles the model to optimize initialization time by:

  • Pre-compiling models before deployment
  • Warming up the Hugging Face cache
  • Creating serialized model files

This command is useful to run before serving a model.

max-pipelines warm-cache [OPTIONS]
max-pipelines warm-cache [OPTIONS]

Example:

Basic cache warming:

max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF

Save serialized model:

max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef \
--quantization-encoding bfloat16
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef \
--quantization-encoding bfloat16

Configuration options

Model configuration

Core settings for model loading and execution.

OptionDescriptionDefaultValues
--engineBackend enginemaxmax|huggingface
--huggingface-repo-id TEXT(required) Hugging Face model repository IDAny valid Hugging Face repo ID (e.g. mistralai/Mistral-7B-v0.1)
--quantization-encodingWeight encoding typefloat32|bfloat16|q4_k|q4_0|q6_k|gptq
--weight-path PATHCustom model weights pathValid file path (supports multiple paths via repeated flags)

Device configuration

Controls hardware placement and memory usage.

OptionDescriptionDefaultValues
--devicesTarget devicescpu|gpu|gpu-{id} (e.g. gpu-0,gpu-1)
--device-specsSpecific device configurationCPUDeviceSpec format (e.g. DeviceSpec(id=-1, device_type='cpu'))
--device-memory-utilizationDevice memory fraction0.9Float between 0.0 and 1.0

Performance tuning

Optimization settings for batch processing, caching, and sequence handling.

OptionDescriptionDefaultValues
--cache-strategyCache strategynaive|continuous
--kv-cache-page-sizeToken count per KVCache page128Positive integer
--max-batch-sizeMaximum cache size per batch1Positive integer
--max-ce-batch-sizeMaximum context encoding batch size32Positive integer
--max-lengthMaximum input sequence lengthPositive integer (must be less than model's max config)
--max-new-tokensMaximum tokens to generate-1Integer (-1 for model max)
--pad-to-multiple-ofInput tensor padding multiple2Positive integer

Model state control

Options for saving or loading model states and handling external code

OptionDescriptionDefaultValues
--force-downloadForce re-download cached filesfalsetrue|false
--save-to-serialized-model-pathSave serialized model pathValid file path
--serialized-model-pathLoad serialized model pathValid file path
--trust-remote-codeAllow custom Hugging Face codefalsetrue|false

Generation parameters

Controls for text generation behavior.

Generation parameters

Controls for generation behavior.

OptionDescriptionDefaultValues
--enable-constrained-decodingEnable constrained generationfalsetrue|false
--enable-echoEnable model echofalsetrue|false
--image_urlURLs of images to include with prompt. Ignored if model doesn't support image inputs[]List of valid URLs
--rope-typeRoPE type for GGUF weightsnone|normal|neox
--top-kLimit sampling to top K tokens1Positive integer (1 for greedy sampling)

Common use cases

Fine-tuned generation

max-pipelines generate \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--quantization-encoding bfloat16 \
--max-new-tokens 200 \
--top-k 50 \
--prompt "Write a technical blog post about machine learning"
max-pipelines generate \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--quantization-encoding bfloat16 \
--max-new-tokens 200 \
--top-k 50 \
--prompt "Write a technical blog post about machine learning"

Optimized production setup

First warm cache:

max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef
max-pipelines warm-cache \
--huggingface-repo-id modularai/Llama-3.1-8B-Instruct-GGUF \
--save-to-serialized-model-path ./model.mef

Then serve with optimized settings:

max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu \
-max-batch-size 8 \
--device-memory-utilization 0.9 \
--cache-strategy paged
max-pipelines serve \
--serialized-model-path ./model.mef \
--devices gpu \
-max-batch-size 8 \
--device-memory-utilization 0.9 \
--cache-strategy paged