v25.1 (2025-02-13)
✨ Highlights
-
Custom ops for GPUs
Our new custom op API allows you to extend MAX Engine with new graph operations written in Mojo that execute on either CPU or GPU, providing full composability and extensibility for your models. See more in the section about GPU programming.
-
Enhanced support for agentic workflows
MAX Serve now supports function calling, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.
MAX Serve now supports structured output (also known as constrained decoding) for MAX models on GPU. This allows you to enforce the output format from a model using an input schema that defines the output structure. Learn more about structured output.
-
Extended model architecture support
-
MAX Serve now supports multimodal models that take both text and image inputs. For example, see how to deploy Llama 3.2 Vision.
-
MAX Serve now supports text embedding models. Learn how to deploy a text embedding model.
-
-
New
max-pipelinesCLI toolInstead of cloning our GitHub repo to access our latest GenAI models, you can instead install the
max-pipelinesCLI tool and quickly run an inference or deploy an endpoint.
Documentation
New tutorials:
Other docs:
-
max-pipelinesCLI
MAX Serve
-
The
/v1/completionsREST endpoint now supports:-
Pre-tokenized prompts.
-
Image inputs for multimodal models such as
Llama-3.2-11B-Vision-Instruct. For an example, see how to generate image descriptions with Llama 3.2 Vision.Known issue: You might receive faulty results because some parts of the text prompt get ignored for certain input combinations. We've identified the problem and will have a fix in a subsequent nightly release.
-
Function calling and tool use, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.
-
Structured output (also known as constrained decoding), which allows you to enforce the output format from a model using a JSON schema and the
response_formatfield. To enable constrained decoding pass--enable-structured-outputwhen running the server. However, this feature currently works for MAX models on GPU only (support for PyTorch models and CPU is in progress). Learn more about structured output.
-
-
Added support for the
/v1/embeddingsAPI endpoint, allowing you to generate vector representations using embedding models. See how to deploy a text embedding model. -
Max Serve can evict requests when the number of available pages in the PagedAttention KVCache is limited. Before, the KV manager would throw an OOM error when a batch that cannot fit in the cache was scheduled.
MAX models
-
Added the
max-pipelinesCLI tool that simplifies the process to run inference with GenAI models (specified with a Hugging Face repo ID) and deploy them to a local endpoint with MAX Serve.Previously, running or serving these models required cloning the modular/max GitHub repo and then running commands such as
magic run llama3.These model-specific commands like
llama3andreplitcommands have been removed. They're now standardized and subsumed by flags like--model-pathin themax-pipelinestool. Arguments such as--max-lengthand--weight-pathare also still supported bymax-pipelines.To view a list of supported model architectures from Hugging Face, run
max-pipelines list. -
Added support for PagedAttention, which improves memory efficiency by partitioning the KV cache into smaller blocks, reducing fragmentation and enabling larger inference batches. You can enable it with
--cache-strategy=pagedand--kv-cache-page-sizewith a value that's a multiple of 128. -
Added support for prefix caching in all cases where PagedAttention is supported. This allows for more efficient usage of KVCache and improved prefill performance for workloads with common prefixes. You can enable it by setting
--enable-prefix-caching. For more information, see Prefix caching with PagedAttention. -
Batch size and max length are now inferred from available memory and the HF Models' default values for max length, respectively. If a configuration leads to an OOM, then we provide recommendations (to the best of our ability) to the user to fit the model into memory.
-
Added support for heterogeneous KV caches for multi-modal models, such as Llama Vision, which cache different KV states for self and cross attention layers.
-
Added support for embedding models, starting with MPNet. For example:
max-pipelines generate \ --model-path=sentence-transformers/all-mpnet-base-v2 \ --prompt="Encode this sentence."Also see how to deploy a text embedding model.
-
Added support for image and text multimodal models:
-
max-pipelines generatenow accepts image input with--image_url. -
Added an experimental Pixtral pipeline you can run as follows:
max-pipelines generate \ --model-path=mistral-community/pixtral-12b \ --prompt="What is in this image? [IMG]" \ --image_url=http://picsum.photos/1024/1024The pipeline is automatically used for all models implementing the
LlavaForConditionalGenerationarchitecture.The implementation currently has a limit of one image. We plan support an arbitrary number of images of mixed sizes soon.
-
Added an experimental Llama Vision pipeline you can run as follows:
max-pipelines generate \ --model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \ --prompt="<|image|><|begin_of_text|>What is in this image?" \ --image_url=http://picsum.photos/1024/1024The pipeline is automatically used for all models implementing the
MllamaForConditionalGenerationarchitecture.Note: This model is gated and requires that you set the
HF_TOKENenvironment variable. See Llama-3.2-11B-Vision-Instruct. -
See how to generate image descriptions with Llama 3.2 Vision.
-
-
Added support for the
Qwen2ForCausalLMmodel architecture (such asQwen/Qwen2.5-7B-Instruct). For example:max-pipelines generate \ --model-path=Qwen/Qwen2.5-7B-Instruct \ --prompt="Write bubble sort in python" \ --quantization-encoding bfloat16 -
Added support for offline batched inference for text-based LLMs, allowing you to load a model and run inference with a batch of inputs directly from Python, instead of relying on an HTTP interface. For an example, see
examples/offline-inference/basic.py. -
The
--max-cache-batch-sizeflag has been deprecated in favor of--max-batch-size. Using--max-cache-batch-sizenow emits a deprecation warning and will stop working in a future release. -
The
--use-gpuflag has been deprecated in favor of--devices=cpu,--devices=gpu, or--devices=gpu-0,gpu-1,.... If the device isn't specified, the model runs on the first available GPU, or CPU if no GPUs are available.
MAX Engine
-
Improved internal kernel compilation speed 1.5 - 4X across different models.
We've revamped our GPU compilation process so that all kernels in a program are compiled together into a single LLVM module, then split into separate kernels afterward. This ensures shared code between kernel entry points is only compiled once. For example, we observe a 3.7x speed up for Llama3.1-8b GPU startup time.
-
Improved initial model execution speed on NVIDIA GPUs.
Instead of compiling to PTX and performing just-in-time compilation during runtime, we now generate CUBIN binaries directly. While this increases initial compilation time, it significantly improves execution speed.
-
The kernels have been further tuned for performance on NVIDIA A100 GPUs.
Graph APIs
-
You can now write custom operations (ops) in Mojo, and add them to a graph constructed in Python, using
custom()andinplace_custom().For more detail, see the section below about GPU programming.
-
Cached compiled MAX graphs that make use of custom operations now get invalidated when the implementation of the custom operations change.
-
Graph.add_weight()now takes an explicitdeviceargument. This enables explicitly passing GPU-resident weights tosession.load()via the weights registry to initialize the model. -
max.graph.Weightnow inherits fromTensorValue, allowing you to callweight.cast()orweight.T. As such, theTensorValueno longer acceptsWeightfor thevalueargument.
Pipeline APIs
-
TextTokenizer.new_context()now supports tool definitions passed through itsrequestargument (viaTokenGeneratorRequest.tools).- It also now supports JSON schemas passed through its
requestargument (viaTokenGeneratorRequest.response_format).
- It also now supports JSON schemas passed through its
-
Removed the default
num_stepsvalue forTokenGenerator.next_token(), ensuring users pass a value, reducing the potential for silent errors. -
KVCacheStrategynow defaults toMODEL_DEFAULT.As opposed to the previous setting which always used the "continuous" caching strategy, KV caching strategy is now defaulted on an architecture-specific basis to ensure the most optimized caching strategy is used.
-
The
Linearlayer now has acreate()class method that automatically creates specializations ofLinearfor non-quantized, k-quant, or GPTQ layers. -
Added
nn.Conv1Dfor audio models like Whisper.
GPU programming
This release includes all new APIs to program on GPUs. The way to write code for GPUs is to create custom operations with GPU functions that you can load into a MAX graph. This foundational API includes a few key components:
-
Mojo APIs to write custom op functions:
-
The
@compiler.registerdecorator is applied to a Mojo struct that implements a custom op in anexecute()function—for either CPU or GPU—and ashape()function that defines the custom op's output tensor. -
The
max.tensorpackage adds essential Mojo APIs for writing custom ops, such as:-
The
foreach()function, which efficiently executes an element-wise computation in parallel on either a GPU or CPU. -
The
ManagedTensorSlicetype defines the input and output tensors for the custom op.
-
-
-
Python APIs to load custom ops into a model:
-
The
custom()andinplace_custom()functions allow you to add the previously-defined Mojo custom op to a MAX graph written in Python. -
The
InferenceSessionconstructor accepts the custom op implementation as a Mojo package in thecustom_extensionsargument.
-
For more detail, see the tutorial to build custom ops for GPUs.
Additionally, we've added a new gpu package to the Mojo
standard library that provides low-level programming constructs for working with
GPUs. These APIs let you do things that you can't currently do with the
high-level foreach() abstraction above. The Mojo gpu APIs allow you to
manually manage interaction between the CPU host and GPU device, manage memory
between devices, synchronize threads, and more. For some examples, see
vector_addition.mojo
and
top_k.mojo.
Mojo
Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!