v25.4 (2025-06-18)
- v25.4 (2025-06-18)
✨ Highlights
-
AMD GPUs are officially supported!
You can now deploy MAX with acceleration on AMD MI300X and MI325X GPUs, using the same code and container that works on NVIDIA GPUs. For the first time, you can build portable, high-performance GenAI deployments that run on any platform without vendor lock-in or platform-specific optimizations.
For more details, including benchmarks, see our Modular + AMD blog post.
-
Now accepting GPU kernel contributions
Last month, we open-sourced the code for the CPU and GPU kernels that power the MAX framework, and now we're accepting contributions! For information about how to contribute and the sort of kernels most interesting to us, see the MAX AI kernels contributing guide.
-
Preview: Mojo interoperability from Python
This release includes an early version of a new Python-to-Mojo interoperability API. You can now write just the performance-critical parts your code in Mojo and call it from Python just like you're importing another Python library. Check out our docs to call Mojo from Python.
Documentation
We've redesigned builds.modular.com and docs.modular.com with a unified top navigation bar that so you can more easily discover all the available docs and code resources.
New docs:
-
GPU Puzzles: Several new puzzles, including: 1D convolution op, softmax op, attention op, embedding op, kernel fusion, custom backward pass, GPU functional programming patterns, and warp fundamentals.
-
Using AI coding assistants guide: Learn how to use large language models (LLMs) and coding assistants (such as Cursor and Claude Code) to accelerate your development with Modular.
-
Build an MLP block as a graph module tutorial: Learn how to create reusable
Modulecomponents in your MAX graphs. -
Write custom ops for PyTorch tutorial (Beta feature): Learn to write high-performance GPU kernels for your PyTorch models with Mojo.
-
Profile MAX kernel performance: Learn how to set up Nsight Compute to profile your Mojo-based kernels on NVIDIA GPUs.
Major updates:
-
Build custom ops for GPUs tutorial: Now includes how to write hardware-specific functions for CPUs and GPUs.
-
Optimize a matrix multiply custom op tutorial: Migrated from a Recipe with revisions to help you improve the performance of your GPU custom ops.
MAX models
-
Added the OLMo 2 model architecture (
olmo2). -
Added Google's Gemma 3 multimodal model architecture (
gemma3multimodal). -
Added the Qwen 3 model architecture (
qwen3). -
Added the InternVL3 model architecture (
internvl). This is still a work in progress. -
GGUF quantized Llamas (q4_0, q4_k, and q6_k) are now supported with paged KVCache strategy.
MAX framework
Inference server
-
Inflight batching no longer requires chunked prefill.
-
Expanded token sampling logic, including top_k, min_p, min_new_tokens, temperature.
-
Extended sampling configuration to be per-request, e.g. different requests can ask for different sampling hyperparameters.
-
Removed support for TorchScript and torch MLIR models.
max CLI
-
Added the
--use-subgraphsflag tomax generateto allow for the use of subgraphs in the model. -
Added the
--portoption to specify the port number with themax servecommand.
Python API
-
Lots of new APIs in the
max.nnpackage. -
Added
max.mojo.importermodule to import Mojo code into Python. See the docs for calling Mojo from Python. -
Added
Graph.add_subgraph()to allow for the addition of a subgraph to a graph. -
Added
Module.build_subgraph()to allow for the creation of a subgraph for a layer that inherits fromModule. -
Added the
callop which allows for the execution of a subgraph. -
Added the
foldop for combining sliding blocks into a larger tensor. -
Added
KernelLibraryas an argument type for theGraphconstructor. -
Added
QuantizationConfigto specify quantization parameters for ops such asqmatmul(). -
Added the
strictargument to theModule.load_state_dict()method. Whenstrict=True(default), an error is raised if thestate_dictcontains unused keys. Whenstrict=False, extra keys are ignored. This helps model developers identify missing implementations in their models. -
Added audio generator APIs for text-to-speech models (such as
AudioGenerator,PipelineAudioTokenizer,TTSContext, and others). This is still a work in progress. -
The
ops.masked_scatter()function now requires naming theout_dimexplicitly as it is data-dependent. For example:ops.masked_scatter( inputs_embeds, video_mask, video_embeds, out_dim="unmasked_inputs" ) -
Deprecated the
CONTINUOUSKVCache strategy (KVCacheStrategy). Please usePAGEDKVCache strategy instead. -
Removed the
Settingsargument fromLLMconstructor. The server is now automatically configured in the background without consuming an HTTP port. -
Removed
Graph.unique_symbolic_dim(). -
Removed
max_to_torch_type()andtorch_to_max_type()and replaced them withDType.to_torch()andDType.from_torch(), respectively. This aligns with the corresponding NumPy methods. -
Removed
stats_reportproperty andreset_stats_reportmethod fromInferenceSession. This functionality was primarily used for internal PyTorch debugging and is no longer needed. -
Removed the naive KVCache (
nn.kv_cache.naive_cache). -
Removed
nn.attentionandnn.naive_attention_with_rope. -
Renamed
ops.selecttoops.where. This matches the name of the similar operation in torch and numpy.
Mojo API
-
LayoutTensornow has asizemethod to get the total number of elements. -
Following our previous deprecation of the Mojo
max.driver,max.graphandmax.engineAPIs, we've removed them from the package and API docs.As a result, we've also removed Mojo
max.tensorAPIs (includingTensor,TensorShape, andTensorSpec). You can replace any use withLayoutTensor.
Custom ops
-
Improved error messages when custom op parameters are provided with values that don't have the proper type.
-
The
ops.custom()function now requires adeviceargument to specify where the operation should execute. This avoids the need for custom ops to infer their execution device, which can be error-prone. -
Added the
max.torchmodule with theCustomOpLibraryclass for using custom Mojo kernels from PyTorch. For example, with a customgrayscaleoperation written in Mojo:@register("grayscale") struct Grayscale: @staticmethod fn execute[ # The kind of device this is running on: "cpu" or "gpu" target: StaticString, ]( img_out: OutputTensor[dtype = DType.uint8, rank=2], img_in: InputTensor[dtype = DType.uint8, rank=3], ctx: DeviceContextPtr, ) raises: ...You can load it with PyTorch like so:
from max.torch import CustomOpLibrary op_library = CustomOpLibrary("path/to/custom.mojopkg") @torch.compile(backend=backend) def grayscale(pic): result = pic.new_empty(pic.shape[:-1]) op_library.grayscale(result, pic) return result img = (torch.rand(64, 64, 3) * 255).to(torch.uint8) result = grayscale(img)See our
tutorial to write custom ops for PyTorch, and our
PyTorch custom operation examples, which range from a very basic "hello world" to the replacement of a layer in a full model.
GPU programming
-
Full support for AMD CDNA3 datacenter GPUs is now available! Specifically, MI300X and MI325X.
-
Added initial support for programming on AMD RDNA3 consumer GPUs. Basic tuning parameters have been specified for AMD Radeon 780m integrated GPUs. (AMD RDNA3 support is for GPU programming only; AI models are still missing some GPU kernels for this architecture.) For details, see the GPU requirements.
-
Now accepting CPU and GPU kernel contributions. See the MAX AI kernels contributing guide.
Mojo language
For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!