What's new

Here's everything you should know about what's changed in each release.

v25.5 nightly

This version is still a work in progress.

See how to install the nightly release.

MAX models

The implementation for the OpenAI API's logprobs and echo request parameters has been rewritten to eliminate an expensive device transfer. The --enable-echo flag, which previously incurred a significant performance penalty, is now 9-12x faster.
Idefics3 model is now supported in MAX

MAX framework

Inference server

Added /health endpoint for service readiness checks, used by tools like lm-eval to determine when the service is ready to accept requests.
Added support for file:// URIs in image inputs for multimodal models. Local file access is controlled via the MAX_SERVE_ALLOWED_IMAGE_ROOTS environment variable, which specifies a list of allowed root directories. Files are read asynchronously using aiofiles for better performance under high load.
Removed --ignore-eos cli argument. The full set of OpenAI chat and completion sampling parameters are now supported in the http requests. As such, the parameter can just be set via the http payload.
Improved tool calling support to more reliably extract JSON tool calling responses for Llama models.

`max` CLI

Pipelines: Added --vision-config-overrides CLI option to override vision model configuration parameters. For example, to decrease InternVL's maximum dynamic patches from 12 to 6:

max serve --model-path OpenGVLab/InternVL3-38B-Instruct \
  --vision-config-overrides '{"max_dynamic_patch": 6}'
max serve --model-path OpenGVLab/InternVL3-38B-Instruct \
  --vision-config-overrides '{"max_dynamic_patch": 6}'

Python API

Removed Model.execute_legacy(). Instead use the standard execute() or __call__() methods for model execution.
Removed TorchScript-related helper functions and APIs from the Engine API, including support for .pt TorchScript files in custom extensions.
Added ops.scatter_nd operation for scattering updates into a tensor at specified indices.
New max.interfaces module added. This module should serve as a relatively import free module to hold all shared interfaces across the MAX stack. Slowly we will be moving common interfaces to this module.
Added max.torch.graph_op interface to make it simple to embed larger MAX computations and models inside PyTorch. These can use max.nn modules internally and may be used within torch.nn modules, allowing the use of MAX subcomponents for access to our high performance graph compiler and Mojo kernel library.

import torch
import numpy as np
import max
from max.dtype import DType
from max.graph import ops

@max.torch.graph_op
def max_grayscale(pic: max.graph.TensorValue):
    scaled = pic.cast(DType.float32) * np.array([0.21, 0.71, 0.07])
    grayscaled = ops.sum(scaled, axis=-1).cast(pic.dtype)
    # max reductions don't remove the dimension, need to squeeze
    return ops.squeeze(grayscaled, axis=-1)

@torch.compile
def grayscale(pic: torch.Tensor):
    output = pic.new_empty(pic.shape[:-1])  # Remove color channel dimension
    max_grayscale(output, pic)  # Call as destination-passing style
    return output

img = (torch.rand(64, 64, 3, device=device) * 255).to(torch.uint8)
result = grayscale(img)
import torch
import numpy as np
import max
from max.dtype import DType
from max.graph import ops

@max.torch.graph_op
def max_grayscale(pic: max.graph.TensorValue):
    scaled = pic.cast(DType.float32) * np.array([0.21, 0.71, 0.07])
    grayscaled = ops.sum(scaled, axis=-1).cast(pic.dtype)
    # max reductions don't remove the dimension, need to squeeze
    return ops.squeeze(grayscaled, axis=-1)

@torch.compile
def grayscale(pic: torch.Tensor):
    output = pic.new_empty(pic.shape[:-1])  # Remove color channel dimension
    max_grayscale(output, pic)  # Call as destination-passing style
    return output

img = (torch.rand(64, 64, 3, device=device) * 255).to(torch.uint8)
result = grayscale(img)

TextGenerationStatus, TextResponse, TextGenerationResponse, InputContext, PipelineTask have been moved to max.interfaces.
TextGenerationStatus has been renamed to GenerationStatus
TextResponse and TextGenerationResponse have been replaced with TextGenerationOutput
EmbeddingsResponse has been renamed to EmbeddingsOutput.
Removed the ability to pass Python objects into models and have them returned as Mojo PythonObject types in the kernels.
Removed RandomWeights.

✨ Highlights

AMD GPUs are officially supported!

You can now deploy MAX with acceleration on AMD MI300X and MI325X GPUs, using the same code and container that works on NVIDIA GPUs. For the first time, you can build portable, high-performance GenAI deployments that run on any platform without vendor lock-in or platform-specific optimizations.

For more details, including benchmarks, see our Modular + AMD blog post.
Now accepting GPU kernel contributions

Last month, we open-sourced the code for the CPU and GPU kernels that power the MAX framework, and now we're accepting contributions! For information about how to contribute and the sort of kernels most interesting to us, see the MAX AI kernels contributing guide.
Preview: Mojo interoperability from Python

This release includes an early version of a new Python-to-Mojo interoperability API. You can now write just the performance-critical parts your code in Mojo and call it from Python just like you're importing another Python library. Check out our docs to call Mojo from Python.

Documentation

We've redesigned builds.modular.com and docs.modular.com with a unified top navigation bar that so you can more easily discover all the available docs and code resources.

New docs:

GPU Puzzles: Several new puzzles, including: 1D convolution op, softmax op, attention op, embedding op, kernel fusion, custom backward pass, GPU functional programming patterns, and warp fundamentals.
Using AI coding assistants guide: Learn how to use large language models (LLMs) and coding assistants (such as Cursor and Claude Code) to accelerate your development with Modular.
Build an MLP block as a graph module tutorial: Learn how to create reusable Module components in your MAX graphs.
Write custom ops for PyTorch tutorial (Beta feature): Learn to write high-performance GPU kernels for your PyTorch models with Mojo.
Profile MAX kernel performance: Learn how to set up Nsight Compute to profile your Mojo-based kernels on NVIDIA GPUs.

Major updates:

Build custom ops for GPUs tutorial: Now includes how to write hardware-specific functions for CPUs and GPUs.
Optimize a matrix multiply custom op tutorial: Migrated from a Recipe with revisions to help you improve the performance of your GPU custom ops.

MAX models

Added the OLMo 2 model architecture (olmo2).

Try OLMo 2 now.
Added Google's Gemma 3 multimodal model architecture (gemma3multimodal).

Try Gemma3 now.
Added the Qwen 3 model architecture (qwen3).

Try Qwen3 now.
Added the InternVL3 model architecture (internvl). This is still a work in progress.
GGUF quantized Llamas (q4_0, q4_k, and q6_k) are now supported with paged KVCache strategy.

MAX framework

Inference server

Inflight batching no longer requires chunked prefill.
Expanded token sampling logic, including top_k, min_p, min_new_tokens, temperature.
Extended sampling configuration to be per-request, e.g. different requests can ask for different sampling hyperparameters.
Removed support for TorchScript and torch MLIR models.

`max` CLI

Added the --use-subgraphs flag to max generate to allow for the use of subgraphs in the model.
Added the --port option to specify the port number with the max serve command.

Python API

Lots of new APIs in the max.nn package.
Added max.mojo.importer module to import Mojo code into Python. See the docs for calling Mojo from Python.
Added Graph.add_subgraph() to allow for the addition of a subgraph to a graph.
Added Module.build_subgraph() to allow for the creation of a subgraph for a layer that inherits from Module.
Added the call op which allows for the execution of a subgraph.
Added the fold op for combining sliding blocks into a larger tensor.
Added KernelLibrary as an argument type for the Graph constructor.
Added QuantizationConfig to specify quantization parameters for ops such as qmatmul().
Added the strict argument to the Module.load_state_dict() method. When strict=True (default), an error is raised if the state_dict contains unused keys. When strict=False, extra keys are ignored. This helps model developers identify missing implementations in their models.
Added audio generator APIs for text-to-speech models (such as AudioGenerator, PipelineAudioTokenizer, TTSContext, and others). This is still a work in progress.

The ops.masked_scatter() function now requires naming the out_dim explicitly as it is data-dependent. For example:

ops.masked_scatter(
    inputs_embeds, video_mask, video_embeds, out_dim="unmasked_inputs"
)
ops.masked_scatter(
    inputs_embeds, video_mask, video_embeds, out_dim="unmasked_inputs"
)

Deprecated the CONTINUOUS KVCache strategy (KVCacheStrategy). Please use PAGED KVCache strategy instead.
Removed the Settings argument from LLM constructor. The server is now automatically configured in the background without consuming an HTTP port.
Removed Graph.unique_symbolic_dim().
Removed max_to_torch_type() and torch_to_max_type() and replaced them with DType.to_torch() and DType.from_torch(), respectively. This aligns with the corresponding NumPy methods.
Removed stats_report property and reset_stats_report method from InferenceSession. This functionality was primarily used for internal PyTorch debugging and is no longer needed.
Removed the naive KVCache (nn.kv_cache.naive_cache).
Removed nn.attention and nn.naive_attention_with_rope.
Renamed ops.select to ops.where. This matches the name of the similar operation in torch and numpy.

Mojo API

LayoutTensor now has a size method to get the total number of elements.
Following our previous deprecation of the Mojo max.driver, max.graph and max.engine APIs, we've removed them from the package and API docs.

As a result, we've also removed Mojo max.tensor APIs (including Tensor, TensorShape, and TensorSpec). You can replace any use with LayoutTensor.

Custom ops

Improved error messages when custom op parameters are provided with values that don't have the proper type.
The ops.custom() function now requires a device argument to specify where the operation should execute. This avoids the need for custom ops to infer their execution device, which can be error-prone.

Added the max.torch module with the CustomOpLibrary class for using custom Mojo kernels from PyTorch. For example, with a custom grayscale operation written in Mojo:

@register("grayscale")
struct Grayscale:
    @staticmethod
    fn execute[
        # The kind of device this is running on: "cpu" or "gpu"
        target: StaticString,
    ](
        img_out: OutputTensor[dtype = DType.uint8, rank=2],
        img_in: InputTensor[dtype = DType.uint8, rank=3],
        ctx: DeviceContextPtr,
    ) raises:
        ...
@register("grayscale")
struct Grayscale:
    @staticmethod
    fn execute[
        # The kind of device this is running on: "cpu" or "gpu"
        target: StaticString,
    ](
        img_out: OutputTensor[dtype = DType.uint8, rank=2],
        img_in: InputTensor[dtype = DType.uint8, rank=3],
        ctx: DeviceContextPtr,
    ) raises:
        ...

You can load it with PyTorch like so:

from max.torch import CustomOpLibrary

op_library = CustomOpLibrary("path/to/custom.mojopkg")

@torch.compile(backend=backend)
def grayscale(pic):
    result = pic.new_empty(pic.shape[:-1])
    op_library.grayscale(result, pic)
    return result

img = (torch.rand(64, 64, 3) * 255).to(torch.uint8)
result = grayscale(img)
from max.torch import CustomOpLibrary

op_library = CustomOpLibrary("path/to/custom.mojopkg")

@torch.compile(backend=backend)
def grayscale(pic):
    result = pic.new_empty(pic.shape[:-1])
    op_library.grayscale(result, pic)
    return result

img = (torch.rand(64, 64, 3) * 255).to(torch.uint8)
result = grayscale(img)

See our tutorial to write custom ops for PyTorch, and out PyTorch custom operation examples, which range from a very basic "hello world" to the replacement of a layer in a full model.

GPU programming

Full support for AMD CDNA3 datacenter GPUs is now available! Specifically, MI300X and MI325X.
Added initial support for programming on AMD RDNA3 consumer GPUs. Basic tuning parameters have been specified for AMD Radeon 780m integrated GPUs. (AMD RDNA3 support is for GPU programming only; AI models are still missing some GPU kernels for this architecture.) For details, see the GPU requirements.
Now accepting CPU and GPU kernel contributions. See the MAX AI kernels contributing guide.

Mojo language

For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

v25.3 (2025-05-06)

Highlights
Documentation
max CLI
MAX models
MAX Serve
MAX Engine & Graph
Kernels
GPU programming
Mojo language

✨ Highlights

You can now install Modular APIs and tools with pip:
```
pip install modular \
  --index-url https://download.pytorch.org/whl/cpu
```
```
pip install modular \
  --index-url https://download.pytorch.org/whl/cpu
```
This installs the max CLI, max Python library, mojo CLI, and Mojo libraries. However, the Mojo LSP and debugger are currently not included.

We use the --index-url argument to ensure that torch installs its CPU dependencies only, thus avoiding a lot of unnecessary GPU packages. This is a temporary workaround until we can remove our dependency on torch.
We open-sourced the MAX AI kernels and the rest of the Mojo standard library!

The MAX AI kernels library is a new Mojo API for writing high-performance and portable programs across CPU and GPU, but it's also the source code for our CPU/GPU kernels. You can now see the Mojo code we use in MAX to power GenAI workloads on CPUs and GPUs.

Just like the Mojo standard library, these kernels are open source under the Apache 2.0 License with LLVM exceptions. Plus, the rest of the Mojo standard library is also now open source in GitHub.
Learn to program GPUs with Mojo GPU Puzzles!

This is a brand new site that offers a hands-on guide to mastering GPU programming with Mojo. Starting from basic concepts, you'll learn step-by-step how to program for GPUs by solving increasingly challenging puzzles.

Documentation

We've restructured the documentation to unify MAX and Mojo documentation under the Modular Platform. We believe this improves content discovery with a simplified navigation and helps unify the platform story as a whole.

We've also added the following new docs:

REST API reference: Although it's not a new API (our serving library has supported OpenAI APIs for the last few versions), this now shows precisely which endpoints and body parameters we support.
Speculative decoding: An introduction to using speculative decoding to reduce latency for LLMs. This feature is still in development.
Offline inference: An introduction to our Python API for running inference with an LLM locally (without sending requests to a serving endpoint).
Introduction to layouts: A guide to working with dense multidimensional arrays on CPUs and GPUs, using new Mojo layout types that abstract-away complex memory layout patterns.

`max` CLI

Renamed the max-pipelines CLI tool to max. We recommend re-installing it as shown in the max CLI docs.
Remove previously deprecated --use-gpu, --serialized_model_path, --save_to_serialized_model_path, --max_cache_batch_size and --huggingface-repo-id options.
Move InputContext, TextContext, and TextAndVisionContext from max.pipelines to max.pipelines.context.

MAX models

Added Llama4ForConditionalGeneration support, featuring new MoE layers. Currently, it is limited to text inputs. Run the model by calling:

max generate --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --devices 0,1,2,3
max generate --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --devices 0,1,2,3

Added support for running text generations using the Mistral 3 24B model. Run the model with:

max generate --model-path mistralai/Mistral-Small-3.1-24B-Instruct-2503 --devices 0
max generate --model-path mistralai/Mistral-Small-3.1-24B-Instruct-2503 --devices 0

Fixed empty textual outputs for certain Mistral models (MAX issue 4193).
Added support for loading a custom pipeline architecture by module. Using --custom-architectures=folder/path/to/import:my_module will lead to loading architectures from the file. The architectures must be exposed via an ARCHITECTURES variable in the file. Once loaded, a model can be run using the new architectures. The flag can be specified multiple times to load more modules.

MAX Serve

Moved from radix trie to hash based prefix caching implementation which has smaller CPU overheads. This improves performance particularly in workloads with high cache reuse rates.
Added experimental support for offloading KVCache to host memory via the --enable-kvcache-swapping-to-host and --host-kvcache-swap-space-gb flags. This allows for superior KVCache reuse through prefix caching in workloads where the reusable KVCache amount exceeds GPU VRAM.
Fixed the usage.prompt_tokens field in the OpenAI API Usage Info response. Previously this field was always set to Null, but now it correctly contains the number of prompt tokens in the request.
Switched from Python Multiprocessing Queue to ZeroMQ. This reduces latencies between frontend server process and model worker process related to networking.
Stray model workers on Linux now terminate more reliably when the parent process is killed.

MAX Engine & Graph

Python API

We now raise an error if there's a mismatch between the expected device of a weight on a graph and the device of the actual tensor data specified in InferenceSession.load().
Removed output_device argument from Model.execute().
Removed the copy_inputs_to_device argument in Model.execute to improve predictability of the API. Now execute() raises a TypeError if arguments are passed whose devices don't match the model.
Swapped the order of the dtype and shape fields of driver.Tensor. Previously, the arguments are ordered as (shape, dtype). They are now swapped to (dtype, shape) to be in line with other tensor-like types.
Replaced some instances of Tensor.zeros with Tensor.__init__ when the engine did not depend on the tensor being zero initialized. This elides the unnecessary memset to provide a minor performance improvement.
Added a new experimental Tensor.inplace_copy_from(). This allows users to copy the contents of one Tensor into another.
Made the default behavior of Weight as expecting the initial allocation on host. A transfer is then inserted to the target device and this value is returned when weights generate an MLIR value. This is done due to current conservative ownership around external weights.
Added the irfft op, which computes the inverse real fast fourier transform (FFT).
Added the argmax op, which returns the index of the maximum value in an array or sequence.
Added the GroupNorm layer.
Switched layer names so that max.nn layers that are implemented with the deprecated Layer class are marked as "V1", and layers that are implemented with the new max.nn.Module are the default. That is, max.nn.LinearV2 is now max.nn.Linear, and the previous max.nn.Linear is now max.nn.LinearV1.
DeviceRefs in types/layers are in general expected to be explicit rather than implicit.

Mojo API

Removed some functionality from tensor.Tensor:
- Serializing Tensor to disk (Tensor.tofile(path) and Tensor.save(path)).
- Reading the serialized data back from disk (Tensor.load(path) and Tensor.fromfile(path).
- rand and randn methods have been removed. Use the ones in the Mojo standard library if you still need access for constructing a new Tensor with random elements based on a particular TensorShape.
Deprecated the Mojo Driver, Graph, and Engine APIs

These APIs are not currently used internally. Instead, we build graphs using the Python APIs, and our engineering efforts have been focused on making that experience as robust and user-friendly as possible. As a result, the Mojo versions of these APIs have not kept pace with new features and language improvements. These APIs will be open sourced for the community before being removed.

Custom ops API

You can now pass Mojo source package paths as Graph custom extensions. The Mojo code will be compiled automatically, no need to run mojo package manually as a prior step. Previously, only pre-compiled .mojopkg paths were accepted, requiring the Mojo code to be built as a prerequisite step before running a Graph with a custom op.

Given a project structure like:
```
project
|-- main.py
\-- kernels
    |-- __init__.mojo
    \-- my_custom_op.mojo
```
```
project
|-- main.py
\-- kernels
    |-- __init__.mojo
    \-- my_custom_op.mojo
```
You can construct a Graph in main.py using Mojo custom op kernels simply using:
```
g = Graph(
  ...,
  custom_extensions = [Path(__file__).parent / "kernels"]
)
```
```
g = Graph(
  ...,
  custom_extensions = [Path(__file__).parent / "kernels"]
)
```
A change to your Mojo source code defining a custom op will be reflected immediately the next time the Graph is constructed.
New image_pipeline example that demonstrates sequencing custom ops together which modify an image, leaving data on the GPU for each op, before writing it back to CPU and disk.

Kernels

More compute overlap is now enabled for Hopper GPUs. This allows finer-grained scheduling of kernel operations by analyzing producer-consumer patterns within a compute kernel. As a result, there is more kernel compute overlap, especially for compute-heavy kernels with data-dependent execution paths.

GPU programming

CUDA driver requirement reduced to version 12.4 and the NVIDIA driver to be version 550. Requiring these earlier driver versions allows MAX to be more easily deployed on AWS and GCP, since these are the default versions used by those cloud providers.
Added support for programming NVIDIA Jetson Orin GPUs (sm_87).

Also see the Mojo changelog of GPU changes.

Mojo language

We recently open-sourced the rest of the Mojo standard library, including the algorithm, benchmark, buffer, compile, complex, gpu, and layout packages. See it all in GitHub.
We've also open sourced all our MAX AI kernels. This new library includes kv_cache, layout, linalg, nn, nvml, and quantization.

For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

v25.2 (2025-03-25)

Highlights
MAX Serve
MAX models
- max-pipelines CLI
MAX Engine
GPU programming
Mojo
Documentation

✨ Highlights

Support for NVIDIA Hopper GPUs

MAX has been optimized to run on Hopper GPUs. For more information on MAX and NVIDIA's hardware, see the MAX container documentation.
Multi-GPU support

MAX uses tensor parallelism to distribute work across multiple GPUs so you can run LLMs like Llama-3.3-70B-Instruct, even with long context window.
Expanded library of MAX models

We're rapidly growing our library of base model architectures that MAX can accelerate with MAX Serve (including Phi3ForCausalLM, OlmoForCausalLM, and GraniteForCausalLM). We also now support GTPQ for the Llama models. For more information, check out our MAX model repository.
Advanced E2E optimizations for long context window

In flight batching, chunked prefill, and copy-on-write optimize the execution for prefix heavy and long context window scenario.
GPU programming with Mojo

Lots of new APIs are now available to enable both low-level GPU programming and abstracted programming patterns that simplify the code required to write GPU kernels for your AI models.

MAX Serve

Extended MAX Serve batch scheduling to account for the prefix cache. The scheduler can now create larger batches when many prompt tokens are already cached, improving throughput up to 10% in some benchmarks.
Added support for in-flight batching, allowing token generation requests to be scheduled alongside context encoding requests to reduce inter-token latency. This behavior can be controlled by CLI argument --enable-in-flight-batch.
Added support for copy-on-write on KV blocks when using PagedAttention with Prefix Caching. This improves the prefix cache hit rate and prefill performance in some scenarios.
MAX Serve now supports transformers v.4.49.0, with a patch to avoid graph breaks when using torch.compile() on Llama models.
Added support for recording HTTP traffic out to a file for diagnostics or later replay.

MAX models

Added support for executing LlamaForCausalLM architecture models on multiple GPUs. The model uses tensor parallelism automatically when passing multiple device IDs to the --devices CLI argument. Try running meta-llama/Llama-3.3-70B-Instruct on 4 GPUs with the following example:

max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
  --quantization-encoding bfloat16 \
  --devices gpu:0,1,2,3 \
  --prompt="Design a
    self-sustaining colony on Neptune's moon Triton with a myth/science
    fusion name, three quantum tech breakthroughs, one ethical debate, a
    neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
  --quantization-encoding bfloat16 \
  --devices gpu:0,1,2,3 \
  --prompt="Design a
    self-sustaining colony on Neptune's moon Triton with a myth/science
    fusion name, three quantum tech breakthroughs, one ethical debate, a
    neon-lit cultural ritual, and a hidden flaw—presented in bullet points."

Added support for the Phi3ForCausalLM model architecture (such as microsoft/phi-4). For example:

max-pipelines generate \
  --model-path microsoft/phi-4 \
  --prompt "Write bubble sort in mojo"
max-pipelines generate \
  --model-path microsoft/phi-4 \
  --prompt "Write bubble sort in mojo"

Added support for the OlmoForCausalLM model architecture (such as allenai/OLMo-1B-0724-hf). For example:

max-pipelines generate \
  --model-path allenai/OLMo-1B-0724-hf \
  --prompt "Write bubble sort in mojo"
max-pipelines generate \
  --model-path allenai/OLMo-1B-0724-hf \
  --prompt "Write bubble sort in mojo"

Added support for the GraniteForCausalLM model architecture (such as ibm-granite/granite-3.1-8b-instruct). For example:

max-pipelines generate \
  --model-path ibm-granite/granite-3.1-8b-instruct \
  --prompt "Write bubble sort in mojo"
max-pipelines generate \
  --model-path ibm-granite/granite-3.1-8b-instruct \
  --prompt "Write bubble sort in mojo"

Added support for:

We now support GPTQ quantization for models that run on the GPU. This is handled transparently when the model weights are specified. For example, this runs Llama 3.1 8B using int4-quantized GPTQ weights:

max-pipelines generate \
  --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
  --prompt "Why is the sky blue?" \
  --max-batch-size 1 \
  --max-length 10000
max-pipelines generate \
  --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
  --prompt "Why is the sky blue?" \
  --max-batch-size 1 \
  --max-length 10000

This reduces the total memory consumption of this model from ~16 GB to ~5 GB, allowing the model to fit in the RAM smaller GPUs.

Model weights are now downloaded in parallel.
Added constraints on whitespace during Structured Output. This reduces tokens counts and improves model adherence.
Added jump ahead decoding during Structured Output. This auto-completes tokens when a singular path forward is identified, improving single completion times by up to ~20% for long prompts.
In the event of an unhandled exception, we now use the standard Python traceback format instead of using pretty-printed Rich tracebacks.
We now need to explicitly import LLM from max.entrypoints.llm rather than the previous max.entrypoints import.
The max.pipelines.dataprocessing.tokenizer and max.pipelines.dataprocessing.gguf_utils modules have been removed.
The previously deprecated PipelineConfig.architecture field and its corresponding --architecture CLI argument have been removed.

`max-pipelines` CLI

The --devices CLI argument now supports a comma-separated list of GPU IDs prefixed with gpu: like --devices=gpu:0,1,2,3. We no longer support the previous --devices=gpu-<N> format.

max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
  --quantization-encoding bfloat16 \
  --devices gpu:0,1,2,3 \
  --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
  --quantization-encoding bfloat16 \
  --devices gpu:0,1,2,3 \
  --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."

Removed --huggingface-repo-id PipelineConfig option and CLI argument in favor of --model-path.
We consolidated --model-path and -weight-path. Valid --weight-path values now override --model-path, which handles both local and remote (Hugging Face) cases. If we cannot derive the weights from the --weight-path, we now fall back to the --model-path, which you must set explicitly.
Added --huggingface-revision option, to allow selecting a non-default branch or a specific commit in a Hugging Face model repository.

MAX Engine

The MAX graph compiler now has kernel caching. This is a significant improvement to our compilation pipeline. Here are some of the highlights:
Up to 28% faster compilation times when making iterative changes to models
Improved caching between different but similar models (up to 27% faster)
Lays foundation for future caching optimizations

What does this mean for you? Faster development cycles! When you're working on model pipelines and making changes to the graph, the graph compiler will now intelligently reuse kernels that haven't changed, significantly reducing compilation times.

The improvements are particularly noticeable during iterative development, with compilation times dropping from ~80s to ~57s in some cases of compiling Llama3.1-8B for 4 GPUs. Even when compiling different models from the same family (like Llama/Granite variants), you'll see significant speedups on subsequent compilations.

Driver APIs

Added Accelerator.can_access(other: Device) -> bool method to check if one device can directly access memory of another device.
Fixed a bug in max.driver.tensor.load_max_tensor() for bfloat16 dtype, which would cause an error about mmap size being too large.
max.driver.Tensor.item() now works on any single-element tensor (previously restricted to rank-0 tensors).
Added Device.synchronize(), which ensures all operations on the device complete before returning.

Removed MojoCallContextPtr in favor of DeviceContextPtr. MojoCallContextPtr only contained a DeviceContextPtr, so this change directly exposes the DeviceContextPtr. Custom ops using MojoCallContextPtr now directly take a DeviceContextPtr argument:

    @staticmethod
    fn execute[
        type: DType, rank: Int
    ](
        output: OutputTensor[type=type, rank=rank],
        input: InputTensor[type=type, rank=rank],
        ctx: MojoCallContextPtr,
    ):
    @staticmethod
    fn execute[
        type: DType, rank: Int
    ](
        output: OutputTensor[type=type, rank=rank],
        input: InputTensor[type=type, rank=rank],
        ctx: MojoCallContextPtr,
    ):

becomes

    @staticmethod
    fn execute[
        type: DType, rank: Int
    ](
        output: OutputTensor[type=type, rank=rank],
        input: InputTensor[type=type, rank=rank],
        ctx: DeviceContextPtr,
    ):
    @staticmethod
    fn execute[
        type: DType, rank: Int
    ](
        output: OutputTensor[type=type, rank=rank],
        input: InputTensor[type=type, rank=rank],
        ctx: DeviceContextPtr,
    ):

You can now skip compiling a GPU kernel first before enqueueing it, and pass a function directly to ctx.enqueue_function[func](...):

fn func():
    print("Hello from GPU")

@register("custom_op")
struct CustomOp:

    @staticmethod
    fn execute(ctx: DeviceContextPtr) raises:
        var dev_ctx = ctx.get_device_context()
        dev_ctx.enqueue_function[func](grid_dim=1, block_dim=1)
fn func():
    print("Hello from GPU")

@register("custom_op")
struct CustomOp:

    @staticmethod
    fn execute(ctx: DeviceContextPtr) raises:
        var dev_ctx = ctx.get_device_context()
        dev_ctx.enqueue_function[func](grid_dim=1, block_dim=1)

However, if you're reusing the same function and parameters multiple times, this incurs some overhead of around 50-500 nanoseconds per enqueue. So you can still compile the function first and pass it to ctx.enqueue_function in this scenario:

var compiled_func = ctx.compile_function[func]()
# Multiple kernel launches with the same function/parameters
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
var compiled_func = ctx.compile_function[func]()
# Multiple kernel launches with the same function/parameters
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)

Changed Accelerator and CPU from factory methods that created Device objects in Python (which were accelerators and CPUs in the C++ implementation) to actual Python types. This change elevates the Accelerator and CPU type concepts to Python, making them types rather than methods.

This allows type annotations in Python. For example, a list of accelerators used to be defined like this:
```
graph_devices: list[DeviceRef]
```
```
graph_devices: list[DeviceRef]
```
Now it can be defined like this:
```
graph_devices: list[Accelerator]
```
```
graph_devices: list[Accelerator]
```
Elementwise operations (e.g. __add__) have been removed from Tensor (that is, tensor_internal.Tensor). This Tensor type is being phased out; please reduce usage in favor of LayoutTensor.

Graph APIs

The nn package is now max.nn.
Added ops.chunk) to support chunking tensors along an axis.
Added support for while loops with ops.while_loop.
Added support for conditional execution with ops.cond.
Added axis reduction overloads for ops.min and ops.max. For example; ops.min(tensor, axis=-1).
The gelu() function now accepts an approximate keyword. The keyword controls the gelu approximation with none, tanh, and fast approximations accepted.
Removed the roundeven() operation from the Python API. The round() operation now has the same behavior as roundeven(), so there is no need for both to exist.
Added helpers to create analogous tensors from buffer types and vice versa.
Added max.nn.Module, a base class for writing layers and constructing networks of layers (e.g. using max.nn.Sequential). Currently, this class supports graph building by ensuring that all weight names are unique and systematically generated. This class also supports managing the weight values with the module.state_dict() and module.load_state_dict() methods. More functionality and documentation will be added in future releases.

Custom ops

Changes have been made to the way that custom ops are registered: rather than using the num_dps_outputs attribute on @compiler.register to specify the number of outputs, that number is now inferred from the signature of the custom operation. Inputs to the operation now use the InputTensor type and outputs from the operation use OutputTensor, instead of the previous ManagedTensorSlice for both. This eliminates the need for a manual num_dps_outputs attribute, and makes it safer to work with these inputs and outputs by preventing accidental writes to input tensors. The new interface looks something like the following:

@compiler.register("add_one_custom")
struct AddOneCustom:
    @staticmethod
    fn execute[
        target: StringLiteral,
    ](
        out: OutputTensor,
        x: InputTensor[type = out.type, rank = out.rank],
        ctx: DeviceContextPtr,
    ) raises:
        @parameter
        @always_inline
        fn elementwise_add_one[
            width: Int
        ](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
            return x.load[width](idx) + 1

        foreach[elementwise_add_one, target=target](out, ctx)
@compiler.register("add_one_custom")
struct AddOneCustom:
    @staticmethod
    fn execute[
        target: StringLiteral,
    ](
        out: OutputTensor,
        x: InputTensor[type = out.type, rank = out.rank],
        ctx: DeviceContextPtr,
    ) raises:
        @parameter
        @always_inline
        fn elementwise_add_one[
            width: Int
        ](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
            return x.load[width](idx) + 1

        foreach[elementwise_add_one, target=target](out, ctx)

The foreach function now raises to be able to handle errors within an elementwise calculation.

Hopper kernels

State-of-the-Art Kernels in Mojo for H100/H200 GPUs

Hopper Architecture Matrix Multiplication Kernels: The implementation achieved performance comparable to NVIDIA's highly optimized cuBLAS library. These kernels take full advantage of the Tensor Cores in Hopper architecture GPUs to accelerate the fundamental matrix multiplication operations that underpin deep learning workloads.
Multi-GPU AllReduce Implementation: The AllReduce operation is critical for distributed inference across multiple GPUs, as it efficiently aggregates gradients. The Mojo implementation surpassed NVIDIA's NCCL library in performance benchmarks. This improvement reduces communication overhead during distributed inference.
MAX Attention Kernel with Flash Attention 3: This implementation incorporates the latest Flash Attention 3 algorithm and extends it, which significantly accelerates the computation of attention mechanisms in transformer models. The MAX attention kernel optimizes memory access patterns and computational steps, reducing both the memory footprint and execution time of attention operations. This is particularly important for LLMs where attention calculations represent a substantial portion of the computational workload.

GPU programming

Added the Mojo max.driver API to enable dispatching GPU functions from Mojo.

Check out examples for GPU programming in Mojo, which use this new API.

Mojo

Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

Documentation

New examples for writing custom ops:

fused_attention demonstrates complex GPU programming using MAX abstractions for a practical use in AI model development.
matrix_multiplication includes a series of progressive optimizations for matrix multiplications on GPUs.
histogram shows how to implement the histogram pattern as a custom op.
New examples for GPU programming in Mojo using the new MAX Driver API

These use a Mojo programming model that should look familiar to CUDA C programmers, showing how to define and dispatch GPU functions within a single Mojo file. These examples recreate the first three samples from the popular textbook "Programming Massively Parallel Processors", showing how basic concepts translate from CUDA into Mojo. Additionally, a Mandelbrot set calculation example that parallels a similar one in the existing custom ops examples.
New MAX containers available. For more information on the base and full MAX containers, see Container contents.

v25.1.1 (2025-02-19)

Fix performance issues in autoregressive models with paged attention by setting sensible default values for --max-num-steps that are platform-specific.

v25.1 (2025-02-13)

Highlights
Documentation
MAX Serve
MAX models
MAX Engine
Mojo

✨ Highlights

Custom ops for GPUs

Our new custom op API allows you to extend MAX Engine with new graph operations written in Mojo that execute on either CPU or GPU, providing full composability and extensibility for your models. See more in the section about GPU programming.
Enhanced support for agentic workflows

MAX Serve now supports function calling, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.

MAX Serve now supports structured output (also known as constrained decoding) for MAX models on GPU. This allows you to enforce the output format from a model using an input schema that defines the output structure. Learn more about structured output.
Extended model architecture support
- MAX Serve now supports multimodal models that take both text and image inputs. For example, see how to deploy Llama 3.2 Vision.
- MAX Serve now supports text embedding models. Learn how to deploy a text embedding model.
New max-pipelines CLI tool

Instead of cloning our GitHub repo to access our latest GenAI models, you can instead install the max-pipelines CLI tool and quickly run an inference or deploy an endpoint. Learn more in the max-pipelines docs.

Documentation

New tutorials:

Other docs:

MAX Serve

The /v1/completions REST endpoint now supports:
- Pre-tokenized prompts.
- Image inputs for multimodal models such as Llama-3.2-11B-Vision-Instruct. For an example, see how to generate image descriptions with Llama 3.2 Vision.
  
  Known issue: You might receive faulty results because some parts of the text prompt get ignored for certain input combinations. We've identified the problem and will have a fix in a subsequent nightly release.
- Function calling and tool use, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.
- Structured output (also known as constrained decoding), which allows you to enforce the output format from a model using a JSON schema and the response_format field. To enable constrained decoding pass --enable-structured-output when running the server. However, this feature currently works for MAX models on GPU only (support for PyTorch models and CPU is in progress). Learn more about structured output.
Added support for the /v1/embeddings API endpoint, allowing you to generate vector representations using embedding models. See how to deploy a text embedding model.
Max Serve can evict requests when the number of available pages in the PagedAttention KVCache is limited. Before, the KV manager would throw an OOM error when a batch that cannot fit in the cache was scheduled.

MAX models

Added the max-pipelines CLI tool that simplifies the process to run inference with GenAI models (specified with a Hugging Face repo ID) and deploy them to a local endpoint with MAX Serve.

Previously, running or serving these models required cloning the modular/max GitHub repo and then running commands such as magic run llama3.

These model-specific commands like llama3 and replit commands have been removed. They're now standardized and subsumed by flags like --model-path in the max-pipelines tool. Arguments such as --max-length and --weight-path are also still supported by max-pipelines.

To view a list of supported model architectures from Hugging Face, run max-pipelines list.
Added support for PagedAttention, which improves memory efficiency by partitioning the KV cache into smaller blocks, reducing fragmentation and enabling larger inference batches. You can enable it with --cache-strategy=paged and --kv-cache-page-size with a value that's a multiple of 128.
Added support for prefix caching in all cases where PagedAttention is supported. This allows for more efficient usage of KVCache and improved prefill performance for workloads with common prefixes. You can enable it by setting --enable-prefix-caching. For more information, see Prefix caching with PagedAttention.
Batch size and max length are now inferred from available memory and the HF Models' default values for max length, respectively. If a configuration leads to an OOM, then we provide recommendations (to the best of our ability) to the user to fit the model into memory.
Added support for heterogeneous KV caches for multi-modal models, such as Llama Vision, which cache different KV states for self and cross attention layers.

Added support for embedding models, starting with MPNet. For example:

max-pipelines generate \
  --model-path=sentence-transformers/all-mpnet-base-v2 \
  --prompt="Encode this sentence."
max-pipelines generate \
  --model-path=sentence-transformers/all-mpnet-base-v2 \
  --prompt="Encode this sentence."

Also see how to deploy a text embedding model.

Added support for image and text multimodal models:
- max-pipelines generate now accepts image input with --image_url.
- Added an experimental Pixtral pipeline you can run as follows:
  max-pipelines generate \ --model-path=mistral-community/pixtral-12b \ --prompt="What is in this image? [IMG]" \ --image_url=/images/artwork/max-serve-cloud.png
  max-pipelines generate \ --model-path=mistral-community/pixtral-12b \ --prompt="What is in this image? [IMG]" \ --image_url=/images/artwork/max-serve-cloud.png
  The pipeline is automatically used for all models implementing the LlavaForConditionalGeneration architecture.
  
  The implementation currently has a limit of one image. We plan support an arbitrary number of images of mixed sizes soon.
- Added an experimental Llama Vision pipeline you can run as follows:
  max-pipelines generate \ --model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \ --prompt="<|image|><|begin_of_text|>What is in this image?" \ --image_url=/images/artwork/max-serve-cloud.png
  max-pipelines generate \ --model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \ --prompt="<|image|><|begin_of_text|>What is in this image?" \ --image_url=/images/artwork/max-serve-cloud.png
  The pipeline is automatically used for all models implementing the MllamaForConditionalGeneration architecture.
  
  Note: This model is gated and requires that you set the HF_TOKEN environment variable. See Llama-3.2-11B-Vision-Instruct.
- See how to generate image descriptions with Llama 3.2 Vision.

Added support for the Qwen2ForCausalLM model architecture (such as Qwen/Qwen2.5-7B-Instruct). For example:

max-pipelines generate \
  --model-path=Qwen/Qwen2.5-7B-Instruct \
  --prompt="Write bubble sort in python" \
  --quantization-encoding bfloat16
max-pipelines generate \
  --model-path=Qwen/Qwen2.5-7B-Instruct \
  --prompt="Write bubble sort in python" \
  --quantization-encoding bfloat16

Added support for offline batched inference for text-based LLMs, allowing you to load a model and run inference with a batch of inputs directly from Python, instead of relying on an HTTP interface. For an example, see examples/offline-inference/basic.py.
The --max-cache-batch-size flag has been deprecated in favor of --max-batch-size. Using --max-cache-batch-size now emits a deprecation warning and will stop working in a future release.
The --use-gpu flag has been deprecated in favor of --devices=cpu, --devices=gpu, or --devices=gpu-0,gpu-1,.... If the device isn't specified, the model runs on the first available GPU, or CPU if no GPUs are available.

MAX Engine

Improved internal kernel compilation speed 1.5 - 4X across different models.

We've revamped our GPU compilation process so that all kernels in a program are compiled together into a single LLVM module, then split into separate kernels afterward. This ensures shared code between kernel entry points is only compiled once. For example, we observe a 3.7x speed up for Llama3.1-8b GPU startup time.
Improved initial model execution speed on NVIDIA GPUs.

Instead of compiling to PTX and performing just-in-time compilation during runtime, we now generate CUBIN binaries directly. While this increases initial compilation time, it significantly improves execution speed.
The kernels have been further tuned for performance on NVIDIA A100 GPUs.

Graph APIs

You can now write custom operations (ops) in Mojo, and add them to a graph constructed in Python, using custom() and inplace_custom().

For more detail, see the section below about GPU programming.
Cached compiled MAX graphs that make use of custom operations now get invalidated when the implementation of the custom operations change.
Graph.add_weight() now takes an explicit device argument. This enables explicitly passing GPU-resident weights to session.load() via the weights registry to initialize the model.
max.graph.Weight now inherits from TensorValue, allowing you to call weight.cast() or weight.T. As such, the TensorValue no longer accepts Weight for the value argument.

Pipeline APIs

TextTokenizer.new_context() now supports tool definitions passed through its request argument (via TokenGeneratorRequest.tools).

It also now supports JSON schemas passed through its request argument (via TokenGeneratorRequest.response_format).
Removed the default num_steps value for TokenGenerator.next_token(), ensuring users pass a value, reducing the potential for silent errors.
KVCacheStrategy now defaults to MODEL_DEFAULT.

As opposed to the previous setting which always used the "continuous" caching strategy, KV caching strategy is now defaulted on an architecture-specific basis to ensure the most optimized caching strategy is used.
The Linear layer now has a create() class method that automatically creates specializations of Linear for non-quantized, k-quant, or GPTQ layers.
Added nn.Conv1D for audio models like Whisper.

GPU programming

This release includes all new APIs to program on GPUs. The way to write code for GPUs is to create custom operations with GPU functions that you can load into a MAX graph. This foundational API includes a few key components:

Mojo APIs to write custom op functions:
- The @compiler.register decorator is applied to a Mojo struct that implements a custom op in an execute() function—for either CPU or GPU—and a shape() function that defines the custom op's output tensor.
- The max.tensor package adds essential Mojo APIs for writing custom ops, such as:
  - The foreach() function, which efficiently executes an element-wise computation in parallel on either a GPU or CPU.
  - The ManagedTensorSlice type defines the input and output tensors for the custom op.
Python APIs to load custom ops into a model:
- The custom() and inplace_custom() functions allow you to add the previously-defined Mojo custom op to a MAX graph written in Python.
- The InferenceSession constructor accepts the custom op implementation as a Mojo package in the custom_extensions argument.

For more detail, see the tutorial to build custom ops for GPUs, or check out this simple example of a custom op.

Additionally, we've added a new gpu package to the Mojo standard library that provides low-level programming constructs for working with GPUs. These APIs let you do things that you can't currently do with the high-level foreach() abstraction above. The Mojo gpu APIs allow you to manually manage interaction between the CPU host and GPU device, manage memory between devices, synchronize threads, and more. For some examples, see vector_addition.mojo and top_k.mojo.

Mojo

v24.6 (2024-12-17)

This is a huge update that offers a first look at our serving library for MAX on GPUs!

Highlights
Documentation
MAX Serve
MAX models
MAX Engine
Mojo

Also check out our blog post introducing MAX 24.6.

✨ Highlights

MAX Engine on GPUs preview

We're excited to share a preview of MAX Engine on GPUs. We've created a few tutorials that demonstrate MAX's ability to run GenAI models with our next-generation MAX graph compiler on NVIDIA GPU architectures (including A100, A10, L4, and L40 GPUs). You can experience it today by deploying Llama 3 on an A100 GPU.
MAX Serve preview

This release also includes an all-new serving interface called MAX Serve. It's a Python-based serving layer that supports both native MAX models when you want a high-performance deployment, and off-the-shelf PyTorch LLMs from Hugging Face when you want to explore and experiment—all with GPU support. It provides an OpenAI-compatible REST endpoint for inference requests, and a Prometheus-compatible metrics endpoint. You can use a magic command to start a local server , or use our ready-to-deploy MAX container to start an endpoint in the cloud. Try it now with an LLM from Hugging Face.
Upgraded MAX models

As we continue to build our Python-based MAX Graph API that allows you to build high-performance GenAI models, we've made a ton of performance improvements to the existing models and added a few new models to our GitHub repo. All the Python-based MAX models now support GPUs and broad model architectures. For example, llama3 adds compatibility for the LlamaForCausalLM family, which includes over 20,000 model variants and weights on Hugging Face.

Documentation

New tutorials:

Other new docs:

Also, our documentation is now available for MAX nightly builds! If you're building with a nightly release, you can switch to see the nightly docs using a toggle to the right of the search bar.

MAX Serve

This release includes a preview of our Python-based serving library called MAX Serve. It simplifies the process to deploy your own inference server with consistent and reliable performance.

MAX Serve currently includes the following features:

Deploys locally and to the cloud with our MAX container image, or with the magic CLI.
An OpenAI-compatible server with streaming /chat/completion and /completion endpoints for LLM inference requests.
Prometheus-compatible metrics endpoint with LLM KPIs (TTFT and ITL) for monitoring and evaluating performance.
Supports most TextGeneration Hugging Face Hub models.
Multiprocess HTTP/model worker architecture to maximize CPU core utilization by distributing multiple incoming requests across multiple processes, ensuring both high throughput and responsiveness.
Continuous heterogeneous batching to combine multiple incoming requests into a single inference (no waiting to fill a batch size) and improve total throughput.

There's much more still in the works for MAX Serve, but you can try it today with our tutorials to Deploy Llama 3 on GPU with MAX Serve.

Known issues:

While this release is enough to support typical chatbot applications, this release does not yet support the function-calling portion of the OpenAI API specification needed to enable robust agentic workflows.
Sampling is still limited and doesn't currently respect temperature or other sampling-related API request input.
Structured generation is not supported.
Support for multi-modal models is still nascent.

MAX models

All of our Python-based GenAI models on GitHub now support GPUs!

As we add more models, we're also building a robust set of libraries and infrastructure that make it easier to build and deploy a growing library of LLMs. Some of which is available in a new max.pipelines package and some of it is alongside the models on GitHub. Here are just some of the highlights:

Deep integration with the Hugging Face ecosystem for a quick-to-deploy experience, such as using HF Model Hub tools to fetch config files, support for weights in safetensor format, support for HF tokenizers, and more. (We also support GGUF weight formats.)
Expanded set of model abstractions for use by different LLM architectures:
- Attention layers (including highly optimized implementations with configurable masking, like AttentionWithRope). The optimized attention layers include variants that accept an attention mask. More memory-efficient variants that don't take a mask instead take a "mask functor" argument to the kernel, which implements masking without materializing a mask by computing a mask value from input coordinates on the fly.
- Transformers such as Transformer and TransformerBlock. These include an initial implementation of ragged tensors—tensors for which each dimension can have a different size, avoiding the use of padding tokens by flattening a batch of sequences of differing lengths.
- Common layers such as RMSNorm , Embedding, and Sequential.
- KV cache management helpers, like ContinuousBatchingKVCacheManager.
- Low-level wrappers over optimized kernels like fused_qk_ragged_rope. These are custom fused kernels that update the KV cache in place. Although they are custom, they reuse the underlying kernel implementation by passing in lambda functions used to retrieve inputs and write to outputs in place.
Added generalized interfaces for text generation such as TokenGenerator and PipelineModel, which provide modularity within the models and serving infrastructure. Also added a plug-in mechanism (PipelineRegistry) to more quickly define new models, tokenizers, and other reusable components. For example, anything that conforms to TokenGenerator can be served using the LLM infrastructure within MAX Serve. We then used this interface to create the following:
- An optimized TextGenerationPipeline that can be combined with any compatible graph and has powerful performance features like graph-based multi-step scheduling, sampling, KV cache management, ragged tensor support, and more.
- A generic HFTextGenerationPipeline that can run any Hugging Face model for which we don't yet have an optimized implementation in eager mode.
Models now accept weights via a weights registry, which is passed to the session.load() method's weights_registry argument. The decoupling of weights and model architecture allows implementing all of the different fine-tunes for a given model with the same graph. Furthermore, because the underlying design is decoupled, we can later expose the ability to compile a model once and swap weights out on the fly, without re-compiling the model.
Added generic implementations of common kernels, which allow you to plug-in different batching strategies (ragged or padded), KV cache management approaches (continuous batching), masking (causal, sliding window, etc.), and position encoding (RoPE or ALIBI) without having to re-write any kernel code. (More about this in a future release.)
Multi-step scheduling to run multiple token-generation steps on GPU before synchronizing to the CPU.

Updated models:

Significant performance upgrades for Llama 3, and expanded compatibility with the LlamaForCausalLM models family. For example, it also supports Llama 3.2 1B and 3B text models.

New models:

Mistral NeMo (and other MistralForCausalLM models)
Replit Code V1.5 3B

Known issues:

The Q4 quantized models currently work on CPU only.
Using a large setting for top-k with the Llama 3.1 model may lead to segmentation faults for certain workloads when run on NVIDIA GPUs. This should be resolved in the latest nightly MAX builds.
The models currently use a smaller default context window than the max_seq_len specified in the Hugging Face configuration files for a given model. This can be manually adjusted by setting the --max-length parameter to the desired context length when serving a model.
Some variants of the supported core models (like LlamaForCausalLM with different number of heads, head sizes, etc.) might not be fully optimized yet. We plan to fully generalize our implementations in a future release.

MAX Engine

MAX Engine includes a lot of the core infrastructure that enables MAX to accelerate AI models on any hardware, such as the graph compiler, runtime, kernels, and the APIs to interact with it all, and it all works without external dependencies such as PyTorch or CUDA.

This release includes a bunch of performance upgrades to our graph compiler and runtime. We've added support for NVIDIA GPU architectures (including A100, A10, L4, and L40 GPUs), and built out new infrastructure so we can quickly add support for other GPU hardware.

Engine API changes:

InferenceSession now accepts a custom_extensions constructor argument, same as load(), to specify model extension libraries.
The Model object is now callable to run an inference.

Breaking changes:

Model.execute() signature changed to support GPUs.
- The execute() function currently doesn't accept keyword arguments. Instead you can pass tensors as a driver.Tensor, int, float, bool, np.generic, or DLPackArray (DLPack). Note that both PyTorch and NumPy arrays implement the DLPack protocol, which means you can also pass either of those types to execute().
- execute_legacy() preserves the semantics of execute() with support for keyword arguments to help with migration, but will be removed in a future release. execute_legacy() doesn't support GPUs.
- Calling execute() with positional arguments still works the same.

Driver APIs

MAX Driver (the max.driver module) is a new component of MAX Engine that's still a work in progress. It provides primitives for working with heterogeneous hardware systems (GPUs and CPUs), such as to allocate on-device memory, transfer data between host and device, query device stats, and more. It's a foundation on which other components of MAX Engine operate (for example, InferenceEngine now uses driver.Tensor to handle model inputs and outputs).

Driver API changes:

Added CUDA() device to open an NVIDIA GPU.
Added support for fp16 and bfloat16 dtypes.
Expanded functionality for max.driver.Device, with new class methods and properties. We are still working on building this out to support more accelerator features.
driver.Tensor (and the InferenceSession.load() argument weights_registry ) now supports zero-copy interoperability with NumPy arrays and PyTorch tensors, using DLPack / DLPackArray.
driver.Tensor has new methods, such as from_dlpack(), element_size() , to(), to_numpy(), view(), zeros(), and more.

MAX Driver APIs are still changing rapidly and not yet ready for general use. We'll publish more documentation in a future release.

Known issues:

MAX Driver is currently limited to managing just one NVIDIA GPU at a time (it does not yet support multi-GPU). It also does not yet support remote devices.
DLPack support is not complete. For example, streams are not yet supported.

Graph compiler

When you load a model into MAX Engine, the graph compiler is the component that inspects and optimizes all graph operations (ops) to deliver the best run time performance on each device.

This release includes various graph compiler improvements:

Major extensions to support NVIDIA GPUs (and other devices in the future), including async copies and caching of JIT'd kernels.
The runtime now performs scheduling to enable GPU compute overlap with the CPU.
New transformations to the Mojo kernels to enable a number of optimizations, including specialization on tensor dimensions, specialization on target hardware, specialization on non-tensor dimension input to kernels, automatic kernel fusion between operators, and more.
New algebraic simplifications and algorithms for ops such as horizontal fusion of matrix multiplications.
New CPU-side primitives for device management that are automatically transformed and optimized to reduce overhead (MAX does not need to use things like CUDA Graphs).
Updated memory planning to preallocate device memory (hoist computation from inference runtime to initialization time) and reduce per-inference overhead.

Graph APIs

The graph compiler is also exposed through the MAX Graph APIs (the max.graph package), which allow you to build high-performance GenAI models in Python.

Graph API changes:

Python stack traces from model execution failures now include a trace to the original op-creation, allowing for easier debugging during development.
The max.graph APIs now include preliminary support for symbolic algebraic expressions using AlgebraicDim, enabling more powerful support for checked dynamic shapes. This allows -Dim("x") - 4. Furthermore, the algebraic expressions simplify to a canonical form, so that for example -Dim("x") - 4 == -(Dim("x") + 4) holds.
More advanced dtype promotion now allows TensorValue math operators to just work when used with NumPy arrays and python primitives.
TensorValue has new methods, such as broadcast_to(), cast(), flatten(), permute(), and more.
Added BufferValue, which allows for device-resident tensors that are read and mutated within the graph.
DType has new methods/properties, align, size_in_bytes, and is_float().
Value constructor accepts more types for value.
TensorValue constructor accepts more types for value.
TensorValue.rebind() accepts a new message argument.

Breaking changes:

Graph.add_weight() now accepts Weight and returns TensorValue. Weight is essentially a named placeholder for a tensor that knows its name, dtype, shape, and optionally device and quantization encoding. Graph.add_weight() stages an op in the graph that is populated by a named weight in the weights registry passed to session.load.
The Weight constructor arguments changed; added align , dtype , and shape; removed assign , filepath, offset, and value.
The ops.scalar() method was removed along with the is_static() and is_symbolic() methods from all graph.type objects.
- Instead of ops.scalar(), use ops.constant().
- Instead of is_static() and is_symbolic(), use isinstance(dim, SymbolicDim) and isinstance(dim, StaticDim).

The MAX Graph APIs are not ready for general use but you can experiment with it now by following this tutorial. We'll add more documentation when we finish some API redesigns.

Custom op registration

Although the APIs to write custom operators (ops) isn't ready for general use, this release includes a significant redesign that lays the groundwork. You might notice some associated APIs in this release and more APIs in the nightlies, so here's a little about the work in progress:

The custom op APIs will allow you to extend MAX Engine with new ops written in Mojo, providing full composability and extensibility for your models. It's the exact same API we use to write MAX Engine's built-in ops such as matmul. That means your custom ops can benefit from all our compiler optimization features such as kernel fusion—your ops are treated the same as all the ops included "in the box."
The new API requires far less adornment at the definition site to enable the MAX model compiler to optimize custom ops along with the rest of the graph (compared to our previous version that used NDBuffer).
Custom ops support "destination passing style" for tensors.
The design composes on top of Mojo's powerful meta programming, as well as the kernel libraries abstractions for composable kernels.

We'll publish more documentation when the custom op API is ready for general use. Check out the MAX repo's nightly branch to see the latest custom op examples.

Known issues:

Custom ops don't have type or lifetime checking. They also don't reason about mutability. Expect lots of sharp corners and segfaults if you hold them wrong while we improve this!

Numeric kernels

The GPU kernels for MAX Engine are built from the ground up in Mojo with no dependencies on external vendor code or libraries. This release includes the following kernel improvements:

AttenGen: a novel way to express attention pattern that's able to express different attention masks, score functions, as well as caching strategies.
State-of-the-art matrix multiplication algorithms with optimizations such as the following:
- Pipelining and double-buffering to overlap data transfer and computation and to hide memory access latency (for both global and shared memory).
- Thread swizzling to avoid shared memory bank conflicts associated with tensor core layouts.
- Block swizzling to increase L2 cache locality.
SplitK/StreamK GEMM algorithms: divides the computation along the shared K dimension into smaller matrices which can then be executed independently on streaming multiprocessors (such as CUDA cores). These algorithms are ideal for matrices with large K dimension but small M dimension.
Large context length MHA: uses SplitK/StreamK to implement the attention mechanism and eliminate the need of a huge score matrix, which drastically reduces memory usage/traffic to enable large context length.
DualGemm: accelerates the multi-layer perceptron (MLP) layers where the left-hand side (LHS) is shared between two matrix multiplications.

Known issues:

The MAX kernels are optimized for bfloat16 on GPUs.
Convolution on GPU is not performance optimized yet.
Although v24.6 technically runs on H100, it doesn't include performance-optimized kernels for that device yet and it isn't recommended.

Mojo

v24.5 (2024-09-13)

✨ Highlights

Mojo and MAX are magical! We've created a new package and virtual environment manager, magic, for MAX and Mojo.
New Llama3.1 pipeline built with the new MAX Graph Python API.
We have not one, but two new Python APIs that we're introducing in this release:
- MAX Graph Python API
- MAX Driver Python API

⭐️ New

Added repeat_interleave graph op.
Added caching for MAX graph models. This means that graph compilation is cached and the executable model is retrieved from cache on the 2nd and subsequent runs. Note that the model cache is architecture specific and isn't portable across different targets.
Support for Python 3.12.

MAX Graph Python API

This Python API will ultimately provide the same low-level programming interface for high-performance inference graphs as the Mojo API. As with the Mojo API, it's an API for graph-building only, and it does not implement support for training.

You can take a look at how the API works in the MAX Graph Python API reference.

MAX Driver Python API

The MAX Driver API allows you to interact with devices (such as CPUs and GPUs) and allocate memory directly onto them. With this API, you interact with this memory as tensors.

Note that this API is still under development, with support for non-host devices, such as GPUs, planned for a future release.

To learn more, check out the MAX Driver Python APIreference.

MAX C API

New APIs for adding torch metadata libraries:

M_setTorchMetadataLibraryPath
M_setTorchMetadataLibraryPtr

🦋 Changed

MAX Engine performance

Compared to v24.4, MAX Engine v24.5 generates tokens for Llama an average of 15%-48% faster.

MAX C API

Simplified the API for adding torch library paths, which now only takes one path per API call, but can be called multiple times to add paths to the config:

M_setTorchLibraries -> M_setTorchLibraryPath

⚠️ Deprecated

The max command line tool is no longer supported and will be removed in a future release.

❌ Removed

Dropped support for Ubuntu 20.04. If you're using Ubuntu, we currently support Ubuntu 22.04 LTS only.
Dropped support for Python 3.8.
Removed built-in PyTorch libraries from the max package. See the FAQ for information on supported torch versions.

v24.4 (2024-06-07)

🔥 Legendary

MAX is now available on macOS! Try it now.
New quantization APIs for MAX Graph. You can now build high-performance graphs in Mojo that use the latest quantization techniques, enabling even faster performance and more system compatibility for large models.

Learn more in the guide to quantize your graph weights.

⭐️ New

MAX Mojo APIs

Added AI pipeline examples in the max repo, with Mojo implementations for common transformer layers, including quantization support.
- New Llama3 pipeline built with MAX Graph.
- New Replit Code pipeline built with MAX Graph.
- New TinyStories pipeline (based on TinyLlama) that offers a simple demo of the MAX Graph quantization API.
Added max.graph.checkpoint package to save and load model weights.

All weights are stored in a TensorDict. You can save and load a TensorDict to disk with save() and load() functions.
Added MAX Graph quantization APIs:
- Added quantization encodings BFloat16Encoding, Q4_0Encoding, Q4_KEncoding, and Q6_KEncoding.
- Added the QuantizationEncoding trait so you can build custom quantization encodings.
- Added Graph.quantize() to create a quantized tensor node.
- Added qmatmul() to perform matrix-multiplication with a float32 and a quantized matrix.
Added some MAX Graph ops:

Added a layer() context manager and current_layer() function to aid in debugging during graph construction. For example:

with graph.layer("foo"):
    with graph.layer("bar"):
        print(graph.current_layer())  # prints "foo.bar"
        x = graph.constant[DType.int64](1)
        graph.output(x)
with graph.layer("foo"):
    with graph.layer("bar"):
        print(graph.current_layer())  # prints "foo.bar"
        x = graph.constant[DType.int64](1)
        graph.output(x)

This adds a path foo.bar to the added nodes, which will be reported during errors.

Added format_system_stack() function to format the stack trace, which we use to print better error messages from error().
Added TensorMap.keys() to get all the tensor key names.

MAX C API

Miscellaneous new APIs:

M_cloneCompileConfig()
M_copyAsyncTensorMap()
M_tensorMapKeys() and M_deleteTensorMapKeys()
M_setTorchLibraries()

🦋 Changed

MAX Mojo API

EngineNumpyView.data() and EngineTensorView.data() functions that return a type-erased pointer were renamed to unsafe_ptr().
TensorMap now conforms to CollectionElement trait to be copyable and movable.
custom_nv() was removed, and its functionality moved into custom() as an function overload, so it can now output a list of tensor symbols.

v24.3 (2024-05-02)

🔥 Legendary

You can now write custom ops for your models with Mojo!

Learn more about MAX extensibility.

🦋 Changed

Added support for named dynamic dimensions. This means you can specify when two or more dimensions in your model's input are dynamic but their sizes at run time must match each other. By specifying each of these dimension sizes with a name (instead of using None to indicate a dynamic size), the MAX Engine compiler can perform additional optimizations. See the notes below for the corresponding API changes that support named dimensions.
Simplified all the APIs to load input specs for models, making them more consistent.

MAX Engine performance

Compared to v24.2, MAX Engine v24.3 shows an average speedup of 10% on PyTorch models, and an average 20% speedup on dynamically quantized ONNX transformers.

MAX Graph API

The max.graph APIs are still changing rapidly, but starting to stabilize.

AnyMoType renamed to Type, MOTensor renamed to TensorType, and MOList renamed to ListType.
Removed ElementType in favor of using DType.
Removed TypeTuple in favor of using List[Type].
Removed the Module type so you can now start building a graph by directly instantiating a Graph.
Some new ops in max.ops, including support for custom ops.

See how to create a custom op in MAX Graph.

MAX Engine Python API

Redesigned InferenceSession.load() to replace the confusing options argument with a custom_ops_path argument.

As a result, CommonLoadOptions, TorchLoadOptions, and TensorFlowLoadOptions have all been removed.
TorchInputSpec now supports named dynamic dimensions (previously, dynamic dimension sizes could be specified only as None). This lets you tell MAX which dynamic dimensions are required to have the same size, which helps MAX better optimize your model.

MAX Engine Mojo API

InferenceSession.load_model() was renamed to load().
Redesigned InferenceSession.load() to replace the confusing config argument with a custom_ops_path argument for use when loading a custom op, and an input_specs argument for use when loading TorchScript models.

Doing so removed LoadOptions and introduced the new InputSpec type to define the input shape/type of a model (instead of LoadOptions).
New ShapeElement type to allow for named dynamic dimensions (in InputSpec).
max.engine.engine module was renamed to max.engine.info.

MAX Engine C API

M_newTorchInputSpec() now supports named dynamic dimensions (via new dimNames argument).

❌ Removed

Removed TensorFlow support in the MAX SDK, so you can no longer load a TensorFlow SavedModel for inference. However, TensorFlow is still available for enterprise customers.

We removed TensorFlow because industry-wide TensorFlow usage has declined significantly, especially for the latest AI innovations. Removing TensorFlow also cuts our package size by over 50% and accelerates the development of other customer-requested features. If you have a production use-case for a TensorFlow model, please contact us.
Removed the Python CommonLoadOptions, TorchLoadOptions, and TensorFlowLoadOptions classes. See note above about InferenceSession.load() changes.
Removed the Mojo LoadOptions type. See the note above about InferenceSession.load() changes.

v24.2.1 (2024-04-11)

You can now import more MAX Graph functions from max.graph.ops instead of using max.graph.ops.elementwise. For example:

from max.graph import ops

var relu = ops.relu(matmul)
from max.graph import ops

var relu = ops.relu(matmul)

v24.2 (2024-03-28)

MAX Engine now supports TorchScript models with dynamic input shapes.

No matter what the input shapes are, you still need to specify the input specs for all TorchScript models.
The Mojo standard library is now open source!

Read more about it in this blog post.
And, of course, lots of Mojo updates, including implicit traits, support for keyword arguments in Python calls, a new List type (previously DynamicVector), some refactoring that might break your code, and much more.

For details, see the Mojo changelog.

v24.1.1 (2024-03-18)

This is a minor release that improves error reports.

v24.1 (2024-02-29)

The first release of the MAX platform is here! 🚀

This is a preview version of the MAX platform. That means it is not ready for production deployment and designed only for local development and evaluation.

Because this is a preview, some API libraries are still in development and subject to change, and some features that we previously announced are not quite ready yet. But there is a lot that you can do in this release!

This release includes our flagship developer tools, currently for Linux only:

MAX Engine: Our state-of-the-art graph compiler and runtime library that executes models from PyTorch and ONNX, with incredible inference speed on a wide range of hardware.
- API libraries in Python, C, and Mojo to run inference with your existing models. See the API references.
- The max benchmark tool, which runs MLPerf benchmarks on any compatible model without writing any code.
- The max visualize tool, which allows you to visualize your model in Netron after partially lowering in MAX Engine.
- An early look at the MAX Graph API, our low-level library for building high-performance inference graphs.
MAX Serving: A preview of our serving wrapper for MAX Engine that provides full interoperability with existing AI serving systems (such as Triton) and that seamlessly deploys within existing container infrastructure (such as Kubernetes).
- A Docker image that runs MAX Engine as a backend for NVIDIA Triton Inference Server.
Mojo: The world's first programming language built from the ground-up for AI developers, with cutting-edge compiler technology that delivers unparalleled performance and programmability for any hardware.
- The latest version of Mojo, the standard library, and the mojo command line tool. These are always included in MAX, so you don't need to download any separate packages.
- The Mojo changes in each release are often quite long, so we're going to continue sharing those in the existing Mojo changelog.

Additionally, we've started a new GitHub repo for MAX, where we currently share a bunch of code examples for our API libraries, including some large model pipelines. You can also use this repo to report issues with MAX.

Model Architecture Support

Added support for the following model architectures:
- OlmoForCausalLM (such as allenai/OLMo-1B-0724-hf)
- GraniteForCausalLM (such as ibm-granite/granite-3.1-8b-instruct)
- Phi3ForCausalLM (for Microsoft Phi-3 models)
- Qwen2ForCausalLM (such as Qwen2 models)
Example usage:
```
max-pipelines generate \
  --model-path allenai/OLMo-1B-0724-hf \
  --prompt "Write bubble sort in mojo"
```
```
max-pipelines generate \
  --model-path allenai/OLMo-1B-0724-hf \
  --prompt "Write bubble sort in mojo"
```
The max.pipelines.dataprocessing.tokenizer and max.pipelines.dataprocessing.gguf_utils modules have been removed.
The previously deprecated PipelineConfig.architecture field and its corresponding --architecture CLI argument have been removed.

`max-pipelines` CLI

The --devices CLI argument now supports a comma-separated list of GPU IDs prefixed with gpu: like --devices=gpu:0,1,2,3. We no longer support the previous --devices=gpu-<N> format.

max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
  --quantization-encoding bfloat16 \
  --devices gpu:0,1,2,3 \
  --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
  --quantization-encoding bfloat16 \
  --devices gpu:0,1,2,3 \
  --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."

Removed --huggingface-repo-id PipelineConfig option and CLI argument in favor of --model-path.
Consolidated -model-path and -weight-path. If valid -weight-path(s) are provided, they'll now override --model-path, which in turn handles both local and remote (Hugging Face) cases. If we cannot derive the weights from the --weight-path(s), we'll now fall back to the --model-path, which has to be set explicitly by the user.
Added --huggingface-revision option, to allow selecting a non-default branch or a specific commit in a Hugging Face model repository.

v25.5 nightly​

MAX models​

MAX framework​

Inference server​

max CLI​

Python API​

v25.4 (2025-06-18)​

✨ Highlights​

Documentation​

MAX models​

MAX framework​

Inference server​

max CLI​

Python API​

Mojo API​

Custom ops​

GPU programming​

Mojo language​

v25.3 (2025-05-06)​

✨ Highlights​

Documentation​

max CLI​

MAX models​

MAX Serve​

MAX Engine & Graph​

Python API​

Mojo API​

Custom ops API​

Kernels​

GPU programming​

Mojo language​

v25.2 (2025-03-25)​

✨ Highlights​

MAX Serve​

MAX models​

max-pipelines CLI​

MAX Engine​

Driver APIs​

Graph APIs​

Custom ops​

Hopper kernels​

GPU programming​

Mojo​

Documentation​

v25.1.1 (2025-02-19)​

v25.1 (2025-02-13)​

✨ Highlights​

Documentation​

MAX Serve​

MAX models​

MAX Engine​

Graph APIs​

Pipeline APIs​

GPU programming​

Mojo​

v24.6 (2024-12-17)​

✨ Highlights​

Documentation​

MAX Serve​

MAX models​

MAX Engine​

Driver APIs​

Graph compiler​

Graph APIs​

Custom op registration​

Numeric kernels​

Mojo​

v24.5 (2024-09-13)​

✨ Highlights​

⭐️ New​

MAX Graph Python API​

MAX Driver Python API​

MAX C API​

🦋 Changed​

MAX Engine performance​

MAX C API​

⚠️ Deprecated​

❌ Removed​

v24.4 (2024-06-07)​

🔥 Legendary​

v25.5 nightly

MAX models

MAX framework

Inference server

`max` CLI

Python API

v25.4 (2025-06-18)

✨ Highlights

Documentation

MAX models

MAX framework

Inference server

`max` CLI

Python API

Mojo API

Custom ops

GPU programming

Mojo language

v25.3 (2025-05-06)

✨ Highlights

Documentation

`max` CLI

MAX models

MAX Serve

MAX Engine & Graph

Python API

Mojo API

Custom ops API

Kernels

GPU programming

Mojo language

v25.2 (2025-03-25)

✨ Highlights

MAX Serve

MAX models

`max-pipelines` CLI

MAX Engine

Driver APIs

Graph APIs

Custom ops

Hopper kernels

GPU programming

Mojo

Documentation

v25.1.1 (2025-02-19)

v25.1 (2025-02-13)

✨ Highlights

Documentation

MAX Serve

MAX models

MAX Engine

Graph APIs

Pipeline APIs

GPU programming

Mojo

v24.6 (2024-12-17)

✨ Highlights

Documentation

MAX Serve

MAX models

MAX Engine

Driver APIs

Graph compiler

Graph APIs

Custom op registration

Numeric kernels

Mojo

v24.5 (2024-09-13)

✨ Highlights

⭐️ New

MAX Graph Python API

MAX Driver Python API

MAX C API

🦋 Changed

MAX Engine performance

MAX C API

⚠️ Deprecated

❌ Removed

v24.4 (2024-06-07)

🔥 Legendary