Skip to main content

v25.4 (2025-06-18)

✨ Highlights

  • AMD GPUs are officially supported!

    You can now deploy MAX with acceleration on AMD MI300X and MI325X GPUs, using the same code and container that works on NVIDIA GPUs. For the first time, you can build portable, high-performance GenAI deployments that run on any platform without vendor lock-in or platform-specific optimizations.

    For more details, including benchmarks, see our Modular + AMD blog post.

  • Now accepting GPU kernel contributions

    Last month, we open-sourced the code for the CPU and GPU kernels that power the MAX framework, and now we're accepting contributions! For information about how to contribute and the sort of kernels most interesting to us, see the MAX AI kernels contributing guide.

  • Preview: Mojo interoperability from Python

    This release includes an early version of a new Python-to-Mojo interoperability API. You can now write just the performance-critical parts your code in Mojo and call it from Python just like you're importing another Python library. Check out our docs to call Mojo from Python.

Documentation

We've redesigned builds.modular.com and docs.modular.com with a unified top navigation bar that so you can more easily discover all the available docs and code resources.

New docs:

Major updates:

MAX models

  • Added the OLMo 2 model architecture (olmo2).

    Try OLMo 2 now.

  • Added Google's Gemma 3 multimodal model architecture (gemma3multimodal).

    Try Gemma3 now.

  • Added the Qwen 3 model architecture (qwen3).

    Try Qwen3 now.

  • Added the InternVL3 model architecture (internvl). This is still a work in progress.

  • GGUF quantized Llamas (q4_0, q4_k, and q6_k) are now supported with paged KVCache strategy.

MAX framework

Inference server

  • Inflight batching no longer requires chunked prefill.

  • Expanded token sampling logic, including top_k, min_p, min_new_tokens, temperature.

  • Extended sampling configuration to be per-request, e.g. different requests can ask for different sampling hyperparameters.

  • Removed support for TorchScript and torch MLIR models.

max CLI

  • Added the --use-subgraphs flag to max generate to allow for the use of subgraphs in the model.

  • Added the --port option to specify the port number with the max serve command.

Python API

  • Lots of new APIs in the max.nn package.

  • Added max.mojo.importer module to import Mojo code into Python. See the docs for calling Mojo from Python.

  • Added Graph.add_subgraph() to allow for the addition of a subgraph to a graph.

  • Added Module.build_subgraph() to allow for the creation of a subgraph for a layer that inherits from Module.

  • Added the call op which allows for the execution of a subgraph.

  • Added the fold op for combining sliding blocks into a larger tensor.

  • Added KernelLibrary as an argument type for the Graph constructor.

  • Added QuantizationConfig to specify quantization parameters for ops such as qmatmul().

  • Added the strict argument to the Module.load_state_dict() method. When strict=True (default), an error is raised if the state_dict contains unused keys. When strict=False, extra keys are ignored. This helps model developers identify missing implementations in their models.

  • Added audio generator APIs for text-to-speech models (such as AudioGenerator, PipelineAudioTokenizer, TTSContext, and others). This is still a work in progress.

  • The ops.masked_scatter() function now requires naming the out_dim explicitly as it is data-dependent. For example:

    ops.masked_scatter(
        inputs_embeds, video_mask, video_embeds, out_dim="unmasked_inputs"
    )
  • Deprecated the CONTINUOUS KVCache strategy (KVCacheStrategy). Please use PAGED KVCache strategy instead.

  • Removed the Settings argument from LLM constructor. The server is now automatically configured in the background without consuming an HTTP port.

  • Removed Graph.unique_symbolic_dim().

  • Removed max_to_torch_type() and torch_to_max_type() and replaced them with DType.to_torch() and DType.from_torch(), respectively. This aligns with the corresponding NumPy methods.

  • Removed stats_report property and reset_stats_report method from InferenceSession. This functionality was primarily used for internal PyTorch debugging and is no longer needed.

  • Removed the naive KVCache (nn.kv_cache.naive_cache).

  • Removed nn.attention and nn.naive_attention_with_rope.

  • Renamed ops.select to ops.where. This matches the name of the similar operation in torch and numpy.

Mojo API

  • LayoutTensor now has a size method to get the total number of elements.

  • Following our previous deprecation of the Mojo max.driver, max.graph and max.engine APIs, we've removed them from the package and API docs.

    As a result, we've also removed Mojo max.tensor APIs (including Tensor, TensorShape, and TensorSpec). You can replace any use with LayoutTensor.

Custom ops

  • Improved error messages when custom op parameters are provided with values that don't have the proper type.

  • The ops.custom() function now requires a device argument to specify where the operation should execute. This avoids the need for custom ops to infer their execution device, which can be error-prone.

  • Added the max.torch module with the CustomOpLibrary class for using custom Mojo kernels from PyTorch. For example, with a custom grayscale operation written in Mojo:

    @register("grayscale")
    struct Grayscale:
        @staticmethod
        fn execute[
            # The kind of device this is running on: "cpu" or "gpu"
            target: StaticString,
        ](
            img_out: OutputTensor[dtype = DType.uint8, rank=2],
            img_in: InputTensor[dtype = DType.uint8, rank=3],
            ctx: DeviceContextPtr,
        ) raises:
            ...

    You can load it with PyTorch like so:

    from max.torch import CustomOpLibrary
    
    op_library = CustomOpLibrary("path/to/custom.mojopkg")
    
    @torch.compile(backend=backend)
    def grayscale(pic):
        result = pic.new_empty(pic.shape[:-1])
        op_library.grayscale(result, pic)
        return result
    
    img = (torch.rand(64, 64, 3) * 255).to(torch.uint8)
    result = grayscale(img)

    See our

    tutorial to write custom ops for PyTorch, and our

    PyTorch custom operation examples, which range from a very basic "hello world" to the replacement of a layer in a full model.

GPU programming

  • Full support for AMD CDNA3 datacenter GPUs is now available! Specifically, MI300X and MI325X.

  • Added initial support for programming on AMD RDNA3 consumer GPUs. Basic tuning parameters have been specified for AMD Radeon 780m integrated GPUs. (AMD RDNA3 support is for GPU programming only; AI models are still missing some GPU kernels for this architecture.) For details, see the GPU requirements.

  • Now accepting CPU and GPU kernel contributions. See the MAX AI kernels contributing guide.

Mojo language

For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

Was this page helpful?