v25.3 (2025-05-06)
- Highlights
- Documentation
maxCLI- MAX models
- MAX Serve
- MAX Engine & Graph
- Kernels
- GPU programming
- Mojo language
✨ Highlights
-
You can now install Modular APIs and tools with pip:
pip install modular \ --index-url https://download.pytorch.org/whl/cpuThis installs the
maxCLI,maxPython library,mojoCLI, and Mojo libraries. However, the Mojo LSP and debugger are currently not included.We use the
--index-urlargument to ensure thattorchinstalls its CPU dependencies only, thus avoiding a lot of unnecessary GPU packages. This is a temporary workaround until we can remove our dependency ontorch. -
We open-sourced the MAX AI kernels and the rest of the Mojo standard library!
The MAX AI kernels library is a new Mojo API for writing high-performance and portable programs across CPU and GPU, but it's also the source code for our CPU/GPU kernels. You can now see the Mojo code we use in MAX to power GenAI workloads on CPUs and GPUs.
Just like the Mojo standard library, these kernels are open source under the Apache 2.0 License with LLVM exceptions. Plus, the rest of the Mojo standard library is also now open source in GitHub.
-
Learn to program GPUs with Mojo GPU Puzzles!
This is a brand new site that offers a hands-on guide to mastering GPU programming with Mojo. Starting from basic concepts, you'll learn step-by-step how to program for GPUs by solving increasingly challenging puzzles.
Documentation
We've restructured the documentation to unify MAX and Mojo documentation under the Modular Platform. We believe this improves content discovery with a simplified navigation and helps unify the platform story as a whole.
We've also added the following new docs:
-
REST API reference: Although it's not a new API (our serving library has supported OpenAI APIs for the last few versions), this now shows precisely which endpoints and body parameters we support.
-
Speculative decoding: An introduction to using speculative decoding to reduce latency for LLMs. This feature is still in development.
-
Offline inference: An introduction to our Python API for running inference with an LLM locally (without sending requests to a serving endpoint).
-
Introduction to layouts: A guide to working with dense multidimensional arrays on CPUs and GPUs, using new Mojo
layouttypes that abstract-away complex memory layout patterns.
max CLI
-
Renamed the
max-pipelinesCLI tool tomax. We recommend re-installing it as shown in themaxCLI docs. -
Remove previously deprecated
--use-gpu,--serialized_model_path,--save_to_serialized_model_path,--max_cache_batch_sizeand--huggingface-repo-idoptions. -
Move
InputContext,TextContext, andTextAndVisionContextfrommax.pipelinestomax.pipelines.context.
MAX models
-
Added
Llama4ForConditionalGenerationsupport, featuring new MoE layers. Currently, it is limited to text inputs. Run the model by calling:max generate --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --devices 0,1,2,3 -
Added support for running text generations using the Mistral 3 24B model. Run the model with:
max generate --model-path mistralai/Mistral-Small-3.1-24B-Instruct-2503 --devices 0 -
Fixed empty textual outputs for certain Mistral models (MAX issue 4193).
-
Added support for loading a custom pipeline architecture by module. Using
--custom-architectures=folder/path/to/import:my_modulewill lead to loading architectures from the file. The architectures must be exposed via anARCHITECTURESvariable in the file. Once loaded, a model can be run using the new architectures. The flag can be specified multiple times to load more modules.
MAX Serve
-
Moved from radix trie to hash based prefix caching implementation which has smaller CPU overheads. This improves performance particularly in workloads with high cache reuse rates.
-
Added experimental support for offloading KVCache to host memory via the
--enable-kvcache-swapping-to-hostand--host-kvcache-swap-space-gbflags. This allows for superior KVCache reuse through prefix caching in workloads where the reusable KVCache amount exceeds GPU VRAM. -
Fixed the
usage.prompt_tokensfield in the OpenAI API Usage Info response. Previously this field was always set to Null, but now it correctly contains the number of prompt tokens in the request. -
Switched from Python Multiprocessing Queue to ZeroMQ. This reduces latencies between frontend server process and model worker process related to networking.
-
Stray model workers on Linux now terminate more reliably when the parent process is killed.
MAX Engine & Graph
Python API
-
We now raise an error if there's a mismatch between the expected device of a weight on a graph and the device of the actual tensor data specified in
InferenceSession.load(). -
Removed
output_deviceargument fromModel.execute(). -
Removed the
copy_inputs_to_deviceargument inModel.executeto improve predictability of the API. Nowexecute()raises aTypeErrorif arguments are passed whose devices don't match the model. -
Swapped the order of the
dtypeandshapefields ofdriver.Tensor. Previously, the arguments are ordered as(shape, dtype). They are now swapped to(dtype, shape)to be in line with other tensor-like types. -
Replaced some instances of
Tensor.zeroswithTensor.__init__when the engine did not depend on the tensor being zero initialized. This elides the unnecessary memset to provide a minor performance improvement. -
Added a new experimental
Tensor.inplace_copy_from(). This allows users to copy the contents of oneTensorinto another. -
Made the default behavior of
Weightas expecting the initial allocation on host. A transfer is then inserted to the target device and this value is returned when weights generate an MLIR value. This is done due to current conservative ownership around external weights. -
Added the
irfftop, which computes the inverse real fast fourier transform (FFT). -
Added the
argmaxop, which returns the index of the maximum value in an array or sequence. -
Added the
GroupNormlayer. -
Switched layer names so that
max.nnlayers that are implemented with the deprecatedLayerclass are marked as "V1", and layers that are implemented with the newmax.nn.Moduleare the default. That is,max.nn.LinearV2is nowmax.nn.Linear, and the previousmax.nn.Linearis nowmax.nn.LinearV1. -
DeviceRefs in types/layers are in general expected to be explicit rather than implicit.
Mojo API
-
Removed some functionality from
tensor.Tensor:- Serializing
Tensorto disk (Tensor.tofile(path)andTensor.save(path)). - Reading the serialized data back from disk (
Tensor.load(path)andTensor.fromfile(path). randandrandnmethods have been removed. Use the ones in the Mojo standard library if you still need access for constructing a newTensorwith random elements based on a particularTensorShape.
- Serializing
-
Deprecated the Mojo Driver, Graph, and Engine APIs
These APIs are not currently used internally. Instead, we build graphs using the Python APIs, and our engineering efforts have been focused on making that experience as robust and user-friendly as possible. As a result, the Mojo versions of these APIs have not kept pace with new features and language improvements. These APIs will be open sourced for the community before being removed.
Custom ops API
-
You can now pass Mojo source package paths as
Graphcustom extensions. The Mojo code will be compiled automatically, no need to runmojo packagemanually as a prior step. Previously, only pre-compiled.mojopkgpaths were accepted, requiring the Mojo code to be built as a prerequisite step before running aGraphwith a custom op.Given a project structure like:
project |-- main.py \-- kernels |-- __init__.mojo \-- my_custom_op.mojoYou can construct a
Graphinmain.pyusing Mojo custom op kernels simply using:g = Graph( ..., custom_extensions = [Path(__file__).parent / "kernels"] )A change to your Mojo source code defining a custom op will be reflected immediately the next time the
Graphis constructed. -
New image_pipeline example that demonstrates sequencing custom ops together which modify an image, leaving data on the GPU for each op, before writing it back to CPU and disk.
Kernels
- More compute overlap is now enabled for Hopper GPUs. This allows finer-grained scheduling of kernel operations by analyzing producer-consumer patterns within a compute kernel. As a result, there is more kernel compute overlap, especially for compute-heavy kernels with data-dependent execution paths.
GPU programming
-
CUDA driver requirement reduced to version 12.4 and the NVIDIA driver to be version 550. Requiring these earlier driver versions allows MAX to be more easily deployed on AWS and GCP, since these are the default versions used by those cloud providers.
-
Added support for programming NVIDIA Jetson Orin GPUs (
sm_87).
Also see the Mojo changelog of GPU changes.
Mojo language
-
We recently open-sourced the rest of the Mojo standard library, including the
algorithm,benchmark,buffer,compile,complex,gpu, andlayoutpackages. See it all in GitHub. -
We've also open sourced all our MAX AI kernels. This new library includes
kv_cache,layout,linalg,nn,nvml, andquantization.
For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!