v25.2 (2025-03-25)
✨ Highlights
-
Support for NVIDIA Hopper GPUs
MAX has been optimized to run on Hopper GPUs. For more information on MAX and NVIDIA's hardware, see the MAX container documentation.
-
Multi-GPU support
MAX uses tensor parallelism to distribute work across multiple GPUs so you can run LLMs like
Llama-3.3-70B-Instruct, even with long context window. -
Expanded library of MAX models
We're rapidly growing our library of base model architectures that MAX can accelerate with MAX Serve (including
Phi3ForCausalLM,OlmoForCausalLM, andGraniteForCausalLM). We also now supportGTPQfor the Llama models. For more information, check out our MAX model repository. -
Advanced E2E optimizations for long context window
In flight batching, chunked prefill, and copy-on-write optimize the execution for prefix heavy and long context window scenario.
-
GPU programming with Mojo
Lots of new APIs are now available to enable both low-level GPU programming and abstracted programming patterns that simplify the code required to write GPU kernels for your AI models.
MAX Serve
-
Extended MAX Serve batch scheduling to account for the prefix cache. The scheduler can now create larger batches when many prompt tokens are already cached, improving throughput up to 10% in some benchmarks.
-
Added support for in-flight batching, allowing token generation requests to be scheduled alongside context encoding requests to reduce inter-token latency. This behavior can be controlled by CLI argument
--enable-in-flight-batch. -
Added support for copy-on-write on KV blocks when using PagedAttention with Prefix Caching. This improves the prefix cache hit rate and prefill performance in some scenarios.
-
MAX Serve now supports
transformersv.4.49.0, with a patch to avoid graph breaks when usingtorch.compile()on Llama models. -
Added support for recording HTTP traffic out to a file for diagnostics or later replay.
MAX models
-
Added support for executing
LlamaForCausalLMarchitecture models on multiple GPUs. The model uses tensor parallelism automatically when passing multiple device IDs to the--devicesCLI argument. Try runningmeta-llama/Llama-3.3-70B-Instructon 4 GPUs with the following example:max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \ --quantization-encoding bfloat16 \ --devices gpu:0,1,2,3 \ --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points." -
Added support for the
Phi3ForCausalLMmodel architecture (such asmicrosoft/phi-4). For example:max-pipelines generate \ --model-path microsoft/phi-4 \ --prompt "Write bubble sort in mojo" -
Added support for the
OlmoForCausalLMmodel architecture (such asallenai/OLMo-1B-0724-hf). For example:max-pipelines generate \ --model-path allenai/OLMo-1B-0724-hf \ --prompt "Write bubble sort in mojo" -
Added support for the
GraniteForCausalLMmodel architecture (such asibm-granite/granite-3.1-8b-instruct). For example:max-pipelines generate \ --model-path ibm-granite/granite-3.1-8b-instruct \ --prompt "Write bubble sort in mojo" -
Added support for:
-
We now support GPTQ quantization for models that run on the GPU. This is handled transparently when the model weights are specified. For example, this runs Llama 3.1 8B using int4-quantized GPTQ weights:
max-pipelines generate \ --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \ --prompt "Why is the sky blue?" \ --max-batch-size 1 \ --max-length 10000This reduces the total memory consumption of this model from ~16 GB to ~5 GB, allowing the model to fit in the RAM smaller GPUs.
-
Model weights are now downloaded in parallel.
-
Added constraints on whitespace during Structured Output. This reduces tokens counts and improves model adherence.
-
Added jump ahead decoding during Structured Output. This auto-completes tokens when a singular path forward is identified, improving single completion times by up to ~20% for long prompts.
-
In the event of an unhandled exception, we now use the standard Python traceback format instead of using pretty-printed Rich tracebacks.
-
We now need to explicitly import
LLMfrommax.entrypoints.llmrather than the previousmax.entrypointsimport. -
The
max.pipelines.dataprocessing.tokenizerandmax.pipelines.dataprocessing.gguf_utilsmodules have been removed. -
The previously deprecated
PipelineConfig.architecturefield and its corresponding--architectureCLI argument have been removed.
max-pipelines CLI
-
The
--devicesCLI argument now supports a comma-separated list of GPU IDs prefixed withgpu:like--devices=gpu:0,1,2,3. We no longer support the previous--devices=gpu-<N>format.max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \ --quantization-encoding bfloat16 \ --devices gpu:0,1,2,3 \ --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points." -
Removed
--huggingface-repo-idPipelineConfig option and CLI argument in favor of--model-path. -
We consolidated
--model-pathand-weight-path. Valid--weight-pathvalues now override--model-path, which handles both local and remote (Hugging Face) cases. If we cannot derive the weights from the--weight-path, we now fall back to the--model-path, which you must set explicitly. -
Added
--huggingface-revisionoption, to allow selecting a non-default branch or a specific commit in a Hugging Face model repository.
MAX Engine
-
The MAX graph compiler now has kernel caching. This is a significant improvement to our compilation pipeline. Here are some of the highlights:
-
Up to 28% faster compilation times when making iterative changes to models
-
Improved caching between different but similar models (up to 27% faster)
-
Lays foundation for future caching optimizations
What does this mean for you? Faster development cycles! When you're working on model pipelines and making changes to the graph, the graph compiler will now intelligently reuse kernels that haven't changed, significantly reducing compilation times.
The improvements are particularly noticeable during iterative development, with compilation times dropping from ~80s to ~57s in some cases of compiling Llama3.1-8B for 4 GPUs. Even when compiling different models from the same family (like Llama/Granite variants), you'll see significant speedups on subsequent compilations.
Driver APIs
-
Added
Accelerator.can_access(other: Device) -> boolmethod to check if one device can directly access memory of another device. -
Fixed a bug in
max.driver.tensor.load_max_tensor()forbfloat16dtype, which would cause an error about mmap size being too large. -
max.driver.Tensor.item()now works on any single-element tensor (previously restricted to rank-0 tensors). -
Added
Device.synchronize(), which ensures all operations on the device complete before returning. -
Removed
MojoCallContextPtrin favor ofDeviceContextPtr.MojoCallContextPtronly contained aDeviceContextPtr, so this change directly exposes theDeviceContextPtr. Custom ops usingMojoCallContextPtrnow directly take aDeviceContextPtrargument:@staticmethod fn execute[ type: DType, rank: Int ]( output: OutputTensor[type=type, rank=rank], input: InputTensor[type=type, rank=rank], ctx: MojoCallContextPtr, ):becomes
@staticmethod fn execute[ type: DType, rank: Int ]( output: OutputTensor[type=type, rank=rank], input: InputTensor[type=type, rank=rank], ctx: DeviceContextPtr, ): -
You can now skip compiling a GPU kernel first before enqueueing it, and pass a function directly to
ctx.enqueue_function[func](...):fn func(): print("Hello from GPU") @register("custom_op") struct CustomOp: @staticmethod fn execute(ctx: DeviceContextPtr) raises: var dev_ctx = ctx.get_device_context() dev_ctx.enqueue_function[func](grid_dim=1, block_dim=1)However, if you're reusing the same function and parameters multiple times, this incurs some overhead of around 50-500 nanoseconds per enqueue. So you can still compile the function first and pass it to
ctx.enqueue_functionin this scenario:var compiled_func = ctx.compile_function[func]() # Multiple kernel launches with the same function/parameters ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) -
Changed
AcceleratorandCPUfrom factory methods that createdDeviceobjects in Python (which were accelerators and CPUs in the C++ implementation) to actual Python types. This change elevates theAcceleratorandCPUtype concepts to Python, making them types rather than methods.This allows type annotations in Python. For example, a list of accelerators used to be defined like this:
graph_devices: list[DeviceRef]Now it can be defined like this:
graph_devices: list[Accelerator] -
Elementwise operations (e.g.
__add__) have been removed fromTensor(that is,tensor_internal.Tensor). ThisTensortype is being phased out; please reduce usage in favor ofLayoutTensor.
Graph APIs
-
The
nnpackage is nowmax.nn. -
Added
ops.chunk) to support chunking tensors along an axis. -
Added support for while loops with
ops.while_loop. -
Added support for conditional execution with
ops.cond. -
Added axis reduction overloads for
ops.minandops.max. For example;ops.min(tensor, axis=-1). -
The
gelu()function now accepts anapproximatekeyword. The keyword controls thegeluapproximation withnone,tanh, andfastapproximations accepted. -
Removed the
roundeven()operation from the Python API. Theround()operation now has the same behavior asroundeven(), so there is no need for both to exist. -
Added helpers to create analogous tensors from buffer types and vice versa.
-
Added
max.nn.Module, a base class for writing layers and constructing networks of layers (e.g. usingmax.nn.Sequential). Currently, this class supports graph building by ensuring that all weight names are unique and systematically generated. This class also supports managing the weight values with themodule.state_dict()andmodule.load_state_dict()methods. More functionality and documentation will be added in future releases.
Custom ops
-
Changes have been made to the way that custom ops are registered: rather than using the
num_dps_outputsattribute on@compiler.registerto specify the number of outputs, that number is now inferred from the signature of the custom operation. Inputs to the operation now use theInputTensortype and outputs from the operation useOutputTensor, instead of the previousManagedTensorSlicefor both. This eliminates the need for a manualnum_dps_outputsattribute, and makes it safer to work with these inputs and outputs by preventing accidental writes to input tensors. The new interface looks something like the following:@compiler.register("add_one_custom") struct AddOneCustom: @staticmethod fn execute[ target: StringLiteral, ]( out: OutputTensor, x: InputTensor[type = out.type, rank = out.rank], ctx: DeviceContextPtr, ) raises: @parameter @always_inline fn elementwise_add_one[ width: Int ](idx: IndexList[x.rank]) -> SIMD[x.type, width]: return x.load[width](idx) + 1 foreach[elementwise_add_one, target=target](out, ctx) -
The
foreachfunction nowraisesto be able to handle errors within an elementwise calculation.
Hopper kernels
State-of-the-Art Kernels in Mojo for H100/H200 GPUs
-
Hopper Architecture Matrix Multiplication Kernels: The implementation achieved performance comparable to NVIDIA's highly optimized cuBLAS library. These kernels take full advantage of the Tensor Cores in Hopper architecture GPUs to accelerate the fundamental matrix multiplication operations that underpin deep learning workloads.
-
Multi-GPU AllReduce Implementation: The AllReduce operation is critical for distributed inference across multiple GPUs, as it efficiently aggregates gradients. The Mojo implementation surpassed NVIDIA's NCCL library in performance benchmarks. This improvement reduces communication overhead during distributed inference.
-
MAX Attention Kernel with Flash Attention 3: This implementation incorporates the latest Flash Attention 3 algorithm and extends it, which significantly accelerates the computation of attention mechanisms in transformer models. The MAX attention kernel optimizes memory access patterns and computational steps, reducing both the memory footprint and execution time of attention operations. This is particularly important for LLMs where attention calculations represent a substantial portion of the computational workload.
GPU programming
- Added the Mojo
max.driverAPI to enable dispatching GPU functions from Mojo.
Check out examples for GPU programming in Mojo, which use this new API.
Mojo
Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.
Documentation
New examples for writing custom ops:
-
fused_attentiondemonstrates complex GPU programming using MAX abstractions for a practical use in AI model development. -
matrix_multiplicationincludes a series of progressive optimizations for matrix multiplications on GPUs. -
histogramshows how to implement the histogram pattern as a custom op. -
New examples for GPU programming in Mojo using the new MAX Driver API
- These use a Mojo programming model that should look familiar to CUDA C programmers, showing how to define and dispatch GPU functions within a single Mojo file. These examples recreate the first three samples from the popular textbook "Programming Massively Parallel Processors", showing how basic concepts translate from CUDA into Mojo. Additionally, a Mandelbrot set calculation example that parallels a similar one in the existing custom ops examples.
-
New MAX containers available. For more information on the base and full MAX containers, see Container contents.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!