Skip to main content

v25.2 (2025-03-25)

✨ Highlights

  • Support for NVIDIA Hopper GPUs

    MAX has been optimized to run on Hopper GPUs. For more information on MAX and NVIDIA's hardware, see the MAX container documentation.

  • Multi-GPU support

    MAX uses tensor parallelism to distribute work across multiple GPUs so you can run LLMs like Llama-3.3-70B-Instruct, even with long context window.

  • Expanded library of MAX models

    We're rapidly growing our library of base model architectures that MAX can accelerate with MAX Serve (including Phi3ForCausalLM, OlmoForCausalLM, and GraniteForCausalLM). We also now support GTPQ for the Llama models. For more information, check out our MAX model repository.

  • Advanced E2E optimizations for long context window

    In flight batching, chunked prefill, and copy-on-write optimize the execution for prefix heavy and long context window scenario.

  • GPU programming with Mojo

    Lots of new APIs are now available to enable both low-level GPU programming and abstracted programming patterns that simplify the code required to write GPU kernels for your AI models.

MAX Serve

  • Extended MAX Serve batch scheduling to account for the prefix cache. The scheduler can now create larger batches when many prompt tokens are already cached, improving throughput up to 10% in some benchmarks.

  • Added support for in-flight batching, allowing token generation requests to be scheduled alongside context encoding requests to reduce inter-token latency. This behavior can be controlled by CLI argument --enable-in-flight-batch.

  • Added support for copy-on-write on KV blocks when using PagedAttention with Prefix Caching. This improves the prefix cache hit rate and prefill performance in some scenarios.

  • MAX Serve now supports transformers v.4.49.0, with a patch to avoid graph breaks when using torch.compile() on Llama models.

  • Added support for recording HTTP traffic out to a file for diagnostics or later replay.

MAX models

  • Added support for executing LlamaForCausalLM architecture models on multiple GPUs. The model uses tensor parallelism automatically when passing multiple device IDs to the --devices CLI argument. Try running meta-llama/Llama-3.3-70B-Instruct on 4 GPUs with the following example:

    max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
      --quantization-encoding bfloat16 \
      --devices gpu:0,1,2,3 \
      --prompt="Design a
        self-sustaining colony on Neptune's moon Triton with a myth/science
        fusion name, three quantum tech breakthroughs, one ethical debate, a
        neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
  • Added support for the Phi3ForCausalLM model architecture (such as microsoft/phi-4). For example:

    max-pipelines generate \
      --model-path microsoft/phi-4 \
      --prompt "Write bubble sort in mojo"
  • Added support for the OlmoForCausalLM model architecture (such as allenai/OLMo-1B-0724-hf). For example:

    max-pipelines generate \
      --model-path allenai/OLMo-1B-0724-hf \
      --prompt "Write bubble sort in mojo"
  • Added support for the GraniteForCausalLM model architecture (such as ibm-granite/granite-3.1-8b-instruct). For example:

    max-pipelines generate \
      --model-path ibm-granite/granite-3.1-8b-instruct \
      --prompt "Write bubble sort in mojo"
  • Added support for:

  • We now support GPTQ quantization for models that run on the GPU. This is handled transparently when the model weights are specified. For example, this runs Llama 3.1 8B using int4-quantized GPTQ weights:

    max-pipelines generate \
      --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
      --prompt "Why is the sky blue?" \
      --max-batch-size 1 \
      --max-length 10000

    This reduces the total memory consumption of this model from ~16 GB to ~5 GB, allowing the model to fit in the RAM smaller GPUs.

  • Model weights are now downloaded in parallel.

  • Added constraints on whitespace during Structured Output. This reduces tokens counts and improves model adherence.

  • Added jump ahead decoding during Structured Output. This auto-completes tokens when a singular path forward is identified, improving single completion times by up to ~20% for long prompts.

  • In the event of an unhandled exception, we now use the standard Python traceback format instead of using pretty-printed Rich tracebacks.

  • We now need to explicitly import LLM from max.entrypoints.llm rather than the previous max.entrypoints import.

  • The max.pipelines.dataprocessing.tokenizer and max.pipelines.dataprocessing.gguf_utils modules have been removed.

  • The previously deprecated PipelineConfig.architecture field and its corresponding --architecture CLI argument have been removed.

max-pipelines CLI

  • The --devices CLI argument now supports a comma-separated list of GPU IDs prefixed with gpu: like --devices=gpu:0,1,2,3. We no longer support the previous --devices=gpu-<N> format.

    max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
      --quantization-encoding bfloat16 \
      --devices gpu:0,1,2,3 \
      --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
  • Removed --huggingface-repo-id PipelineConfig option and CLI argument in favor of --model-path.

  • We consolidated --model-path and -weight-path. Valid --weight-path values now override --model-path, which handles both local and remote (Hugging Face) cases. If we cannot derive the weights from the --weight-path, we now fall back to the --model-path, which you must set explicitly.

  • Added --huggingface-revision option, to allow selecting a non-default branch or a specific commit in a Hugging Face model repository.

MAX Engine

  • The MAX graph compiler now has kernel caching. This is a significant improvement to our compilation pipeline. Here are some of the highlights:

  • Up to 28% faster compilation times when making iterative changes to models

  • Improved caching between different but similar models (up to 27% faster)

  • Lays foundation for future caching optimizations

What does this mean for you? Faster development cycles! When you're working on model pipelines and making changes to the graph, the graph compiler will now intelligently reuse kernels that haven't changed, significantly reducing compilation times.

The improvements are particularly noticeable during iterative development, with compilation times dropping from ~80s to ~57s in some cases of compiling Llama3.1-8B for 4 GPUs. Even when compiling different models from the same family (like Llama/Granite variants), you'll see significant speedups on subsequent compilations.

Driver APIs

  • Added Accelerator.can_access(other: Device) -> bool method to check if one device can directly access memory of another device.

  • Fixed a bug in max.driver.tensor.load_max_tensor() for bfloat16 dtype, which would cause an error about mmap size being too large.

  • max.driver.Tensor.item() now works on any single-element tensor (previously restricted to rank-0 tensors).

  • Added Device.synchronize(), which ensures all operations on the device complete before returning.

  • Removed MojoCallContextPtr in favor of DeviceContextPtr. MojoCallContextPtr only contained a DeviceContextPtr, so this change directly exposes the DeviceContextPtr. Custom ops using MojoCallContextPtr now directly take a DeviceContextPtr argument:

        @staticmethod
        fn execute[
            type: DType, rank: Int
        ](
            output: OutputTensor[type=type, rank=rank],
            input: InputTensor[type=type, rank=rank],
            ctx: MojoCallContextPtr,
        ):

    becomes

        @staticmethod
        fn execute[
            type: DType, rank: Int
        ](
            output: OutputTensor[type=type, rank=rank],
            input: InputTensor[type=type, rank=rank],
            ctx: DeviceContextPtr,
        ):
  • You can now skip compiling a GPU kernel first before enqueueing it, and pass a function directly to ctx.enqueue_function[func](...):

    fn func():
        print("Hello from GPU")
    
    @register("custom_op")
    struct CustomOp:
    
        @staticmethod
        fn execute(ctx: DeviceContextPtr) raises:
            var dev_ctx = ctx.get_device_context()
            dev_ctx.enqueue_function[func](grid_dim=1, block_dim=1)

    However, if you're reusing the same function and parameters multiple times, this incurs some overhead of around 50-500 nanoseconds per enqueue. So you can still compile the function first and pass it to ctx.enqueue_function in this scenario:

    var compiled_func = ctx.compile_function[func]()
    # Multiple kernel launches with the same function/parameters
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
  • Changed Accelerator and CPU from factory methods that created Device objects in Python (which were accelerators and CPUs in the C++ implementation) to actual Python types. This change elevates the Accelerator and CPU type concepts to Python, making them types rather than methods.

    This allows type annotations in Python. For example, a list of accelerators used to be defined like this:

    graph_devices: list[DeviceRef]

    Now it can be defined like this:

    graph_devices: list[Accelerator]
  • Elementwise operations (e.g. __add__) have been removed from Tensor (that is, tensor_internal.Tensor). This Tensor type is being phased out; please reduce usage in favor of LayoutTensor.

Graph APIs

  • The nn package is now max.nn.

  • Added ops.chunk) to support chunking tensors along an axis.

  • Added support for while loops with ops.while_loop.

  • Added support for conditional execution with ops.cond.

  • Added axis reduction overloads for ops.min and ops.max. For example; ops.min(tensor, axis=-1).

  • The gelu() function now accepts an approximate keyword. The keyword controls the gelu approximation with none, tanh, and fast approximations accepted.

  • Removed the roundeven() operation from the Python API. The round() operation now has the same behavior as roundeven(), so there is no need for both to exist.

  • Added helpers to create analogous tensors from buffer types and vice versa.

  • Added max.nn.Module, a base class for writing layers and constructing networks of layers (e.g. using max.nn.Sequential). Currently, this class supports graph building by ensuring that all weight names are unique and systematically generated. This class also supports managing the weight values with the module.state_dict() and module.load_state_dict() methods. More functionality and documentation will be added in future releases.

Custom ops

  • Changes have been made to the way that custom ops are registered: rather than using the num_dps_outputs attribute on @compiler.register to specify the number of outputs, that number is now inferred from the signature of the custom operation. Inputs to the operation now use the InputTensor type and outputs from the operation use OutputTensor, instead of the previous ManagedTensorSlice for both. This eliminates the need for a manual num_dps_outputs attribute, and makes it safer to work with these inputs and outputs by preventing accidental writes to input tensors. The new interface looks something like the following:

    @compiler.register("add_one_custom")
    struct AddOneCustom:
        @staticmethod
        fn execute[
            target: StringLiteral,
        ](
            out: OutputTensor,
            x: InputTensor[type = out.type, rank = out.rank],
            ctx: DeviceContextPtr,
        ) raises:
            @parameter
            @always_inline
            fn elementwise_add_one[
                width: Int
            ](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
                return x.load[width](idx) + 1
    
            foreach[elementwise_add_one, target=target](out, ctx)
  • The foreach function now raises to be able to handle errors within an elementwise calculation.

Hopper kernels

State-of-the-Art Kernels in Mojo for H100/H200 GPUs

  • Hopper Architecture Matrix Multiplication Kernels: The implementation achieved performance comparable to NVIDIA's highly optimized cuBLAS library. These kernels take full advantage of the Tensor Cores in Hopper architecture GPUs to accelerate the fundamental matrix multiplication operations that underpin deep learning workloads.

  • Multi-GPU AllReduce Implementation: The AllReduce operation is critical for distributed inference across multiple GPUs, as it efficiently aggregates gradients. The Mojo implementation surpassed NVIDIA's NCCL library in performance benchmarks. This improvement reduces communication overhead during distributed inference.

  • MAX Attention Kernel with Flash Attention 3: This implementation incorporates the latest Flash Attention 3 algorithm and extends it, which significantly accelerates the computation of attention mechanisms in transformer models. The MAX attention kernel optimizes memory access patterns and computational steps, reducing both the memory footprint and execution time of attention operations. This is particularly important for LLMs where attention calculations represent a substantial portion of the computational workload.

GPU programming

  • Added the Mojo max.driver API to enable dispatching GPU functions from Mojo.

Check out examples for GPU programming in Mojo, which use this new API.

Mojo

Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

Documentation

New examples for writing custom ops:

  • fused_attention demonstrates complex GPU programming using MAX abstractions for a practical use in AI model development.

  • matrix_multiplication includes a series of progressive optimizations for matrix multiplications on GPUs.

  • histogram shows how to implement the histogram pattern as a custom op.

  • New examples for GPU programming in Mojo using the new MAX Driver API

    • These use a Mojo programming model that should look familiar to CUDA C programmers, showing how to define and dispatch GPU functions within a single Mojo file. These examples recreate the first three samples from the popular textbook "Programming Massively Parallel Processors", showing how basic concepts translate from CUDA into Mojo. Additionally, a Mandelbrot set calculation example that parallels a similar one in the existing custom ops examples.
  • New MAX containers available. For more information on the base and full MAX containers, see Container contents.

Was this page helpful?