Skip to main content

v25.3 (2025-05-06)

✨ Highlights

  • You can now install Modular APIs and tools with pip:

    pip install modular \
      --index-url https://download.pytorch.org/whl/cpu

    This installs the max CLI, max Python library, mojo CLI, and Mojo libraries. However, the Mojo LSP and debugger are currently not included.

    We use the --index-url argument to ensure that torch installs its CPU dependencies only, thus avoiding a lot of unnecessary GPU packages. This is a temporary workaround until we can remove our dependency on torch.

  • We open-sourced the MAX AI kernels and the rest of the Mojo standard library!

    The MAX AI kernels library is a new Mojo API for writing high-performance and portable programs across CPU and GPU, but it's also the source code for our CPU/GPU kernels. You can now see the Mojo code we use in MAX to power GenAI workloads on CPUs and GPUs.

    Just like the Mojo standard library, these kernels are open source under the Apache 2.0 License with LLVM exceptions. Plus, the rest of the Mojo standard library is also now open source in GitHub.

  • Learn to program GPUs with Mojo GPU Puzzles!

    This is a brand new site that offers a hands-on guide to mastering GPU programming with Mojo. Starting from basic concepts, you'll learn step-by-step how to program for GPUs by solving increasingly challenging puzzles.

Documentation

We've restructured the documentation to unify MAX and Mojo documentation under the Modular Platform. We believe this improves content discovery with a simplified navigation and helps unify the platform story as a whole.

We've also added the following new docs:

  • REST API reference: Although it's not a new API (our serving library has supported OpenAI APIs for the last few versions), this now shows precisely which endpoints and body parameters we support.

  • Speculative decoding: An introduction to using speculative decoding to reduce latency for LLMs. This feature is still in development.

  • Offline inference: An introduction to our Python API for running inference with an LLM locally (without sending requests to a serving endpoint).

  • Introduction to layouts: A guide to working with dense multidimensional arrays on CPUs and GPUs, using new Mojo layout types that abstract-away complex memory layout patterns.

max CLI

  • Renamed the max-pipelines CLI tool to max. We recommend re-installing it as shown in the max CLI docs.

  • Remove previously deprecated --use-gpu, --serialized_model_path, --save_to_serialized_model_path, --max_cache_batch_size and --huggingface-repo-id options.

  • Move InputContext, TextContext, and TextAndVisionContext from max.pipelines to max.pipelines.context.

MAX models

  • Added Llama4ForConditionalGeneration support, featuring new MoE layers. Currently, it is limited to text inputs. Run the model by calling:

    max generate --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --devices 0,1,2,3
  • Added support for running text generations using the Mistral 3 24B model. Run the model with:

    max generate --model-path mistralai/Mistral-Small-3.1-24B-Instruct-2503 --devices 0
  • Fixed empty textual outputs for certain Mistral models (MAX issue 4193).

  • Added support for loading a custom pipeline architecture by module. Using --custom-architectures=folder/path/to/import:my_module will lead to loading architectures from the file. The architectures must be exposed via an ARCHITECTURES variable in the file. Once loaded, a model can be run using the new architectures. The flag can be specified multiple times to load more modules.

MAX Serve

  • Moved from radix trie to hash based prefix caching implementation which has smaller CPU overheads. This improves performance particularly in workloads with high cache reuse rates.

  • Added experimental support for offloading KVCache to host memory via the --enable-kvcache-swapping-to-host and --host-kvcache-swap-space-gb flags. This allows for superior KVCache reuse through prefix caching in workloads where the reusable KVCache amount exceeds GPU VRAM.

  • Fixed the usage.prompt_tokens field in the OpenAI API Usage Info response. Previously this field was always set to Null, but now it correctly contains the number of prompt tokens in the request.

  • Switched from Python Multiprocessing Queue to ZeroMQ. This reduces latencies between frontend server process and model worker process related to networking.

  • Stray model workers on Linux now terminate more reliably when the parent process is killed.

MAX Engine & Graph

Python API

  • We now raise an error if there's a mismatch between the expected device of a weight on a graph and the device of the actual tensor data specified in InferenceSession.load().

  • Removed output_device argument from Model.execute().

  • Removed the copy_inputs_to_device argument in Model.execute to improve predictability of the API. Now execute() raises a TypeError if arguments are passed whose devices don't match the model.

  • Swapped the order of the dtype and shape fields of driver.Tensor. Previously, the arguments are ordered as (shape, dtype). They are now swapped to (dtype, shape) to be in line with other tensor-like types.

  • Replaced some instances of Tensor.zeros with Tensor.__init__ when the engine did not depend on the tensor being zero initialized. This elides the unnecessary memset to provide a minor performance improvement.

  • Added a new experimental Tensor.inplace_copy_from(). This allows users to copy the contents of one Tensor into another.

  • Made the default behavior of Weight as expecting the initial allocation on host. A transfer is then inserted to the target device and this value is returned when weights generate an MLIR value. This is done due to current conservative ownership around external weights.

  • Added the irfft op, which computes the inverse real fast fourier transform (FFT).

  • Added the argmax op, which returns the index of the maximum value in an array or sequence.

  • Added the GroupNorm layer.

  • Switched layer names so that max.nn layers that are implemented with the deprecated Layer class are marked as "V1", and layers that are implemented with the new max.nn.Module are the default. That is, max.nn.LinearV2 is now max.nn.Linear, and the previous max.nn.Linear is now max.nn.LinearV1.

  • DeviceRefs in types/layers are in general expected to be explicit rather than implicit.

Mojo API

  • Removed some functionality from tensor.Tensor:

    • Serializing Tensor to disk (Tensor.tofile(path) and Tensor.save(path)).
    • Reading the serialized data back from disk (Tensor.load(path) and Tensor.fromfile(path).
    • rand and randn methods have been removed. Use the ones in the Mojo standard library if you still need access for constructing a new Tensor with random elements based on a particular TensorShape.
  • Deprecated the Mojo Driver, Graph, and Engine APIs

    These APIs are not currently used internally. Instead, we build graphs using the Python APIs, and our engineering efforts have been focused on making that experience as robust and user-friendly as possible. As a result, the Mojo versions of these APIs have not kept pace with new features and language improvements. These APIs will be open sourced for the community before being removed.

Custom ops API

  • You can now pass Mojo source package paths as Graph custom extensions. The Mojo code will be compiled automatically, no need to run mojo package manually as a prior step. Previously, only pre-compiled .mojopkg paths were accepted, requiring the Mojo code to be built as a prerequisite step before running a Graph with a custom op.

    Given a project structure like:

    project
    |-- main.py
    \-- kernels
        |-- __init__.mojo
        \-- my_custom_op.mojo

    You can construct a Graph in main.py using Mojo custom op kernels simply using:

    g = Graph(
      ...,
      custom_extensions = [Path(__file__).parent / "kernels"]
    )

    A change to your Mojo source code defining a custom op will be reflected immediately the next time the Graph is constructed.

  • New image_pipeline example that demonstrates sequencing custom ops together which modify an image, leaving data on the GPU for each op, before writing it back to CPU and disk.

Kernels

  • More compute overlap is now enabled for Hopper GPUs. This allows finer-grained scheduling of kernel operations by analyzing producer-consumer patterns within a compute kernel. As a result, there is more kernel compute overlap, especially for compute-heavy kernels with data-dependent execution paths.

GPU programming

  • CUDA driver requirement reduced to version 12.4 and the NVIDIA driver to be version 550. Requiring these earlier driver versions allows MAX to be more easily deployed on AWS and GCP, since these are the default versions used by those cloud providers.

  • Added support for programming NVIDIA Jetson Orin GPUs (sm_87).

Also see the Mojo changelog of GPU changes.

Mojo language

  • We recently open-sourced the rest of the Mojo standard library, including the algorithm, benchmark, buffer, compile, complex, gpu, and layout packages. See it all in GitHub.

  • We've also open sourced all our MAX AI kernels. This new library includes kv_cache, layout, linalg, nn, nvml, and quantization.

For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

Was this page helpful?