Skip to main content

v25.5 (2025-08-05)

Highlights

  • OpenAI-compatible batch API: The /v1/batches API is now available with Mammoth.

    We recently announced a partnership with SF Compute to make this API available through their dynamic GPU pricing marketplace. Their Large Scale Inference Batch API looks different from the /v1/batches API in Mammoth because it's a superset.

  • New mojo Conda package: For Mojo-specific projects that run on CPUs and GPUs, you can now install the bare essentials with the mojo Conda package that's less than 900 MB on disk. For example, this now works:

    pixi add mojo

    The mojo Python package is not available for pip/uv yet.

    For a complete model-development and serving toolkit, you should still install the modular package (which includes mojo as a dependency).

  • Open-source graph APIs: We've added the max.graph Python APIs to our GitHub repo. We've made great strides in recent months to simplify these APIs that help you build high-performance models you can serve with MAX.

Documentation

MAX models

MAX framework

  • Removed all torch package dependencies.

    • Reduces the total installation size of modular (including dependencies) from 2.2 GB for CPUs and 6.5 GB for GPUs down to 1.5 GB, for all Python packages. Conda packages pull additional system dependencies so sizes may vary, but one example brings the size down from 9.8 GB to 2.0 GB.

    • pip install no longer requires the --extra-index-url https://download.pytorch.org/whl/cpu option (which was to avoid installing the GPU version of torch that has a lot of CUDA dependencies).

    • uv pip install no longer requires the --index-strategy unsafe-best-match option (which was to avoid package resolution issues with the above --extra-index-url option).

  • Removed HuggingFace fallback for model pipelines not natively supported in MAX (PipelineEngine.HUGGINGFACE), because it's almost never used and it creates significant tech debt.

Inference server

  • Added the /health endpoint for service readiness checks, used by tools like lm-eval to determine when the service is ready to accept requests.

  • Prefix caching now uses a Mojo token hashing operation. Previously we used the hash() method from the Python stdlib. However, this resulted in noticeable CPU overhead and reduced GPU utilization. In this release, we migrated the token hashing operation to an accelerated Mojo implementation.

  • Re-implemented the OpenAI API's logprobs and echo request parameters to eliminate an expensive device transfer. The --enable-echo flag, which previously incurred a significant performance penalty, is now 9-12x faster.

  • Added support for file:// URIs in image inputs for multimodal models. Local file access is controlled via the MAX_SERVE_ALLOWED_IMAGE_ROOTS environment variable, which specifies a list of allowed root directories. Files are read asynchronously using aiofiles for better performance under high load.

  • Improved function calling (tool use) to more reliably extract JSON tool calling responses for Llama models in an OpenAI-compatible format.

  • Switched from XGrammar to llguidance for generating structured output (constrained decoding).

max CLI

  • Added --vision-config-overrides CLI option to override vision model configuration parameters. For example, to decrease InternVL's maximum dynamic patches from 12 to 6:

    max serve --model-path OpenGVLab/InternVL3-38B-Instruct \
      --vision-config-overrides '{"max_dynamic_patch": 6}'
  • Removed --ignore-eos CLI argument. The full set of OpenAI chat and completion sampling parameters are now supported in the http requests. As such, the parameter can just be set via the http payload.

Python API

  • Added the max.interfaces module. This module should serve as a relatively import free module to hold all shared interfaces across the MAX stack. Slowly we will be moving common interfaces to this module. So far, we've moved the following from max.pipelines.core:

    • Moved TextGenerationStatus, TextResponse, TextGenerationResponse, InputContext, and PipelineTask into max.interfaces.

    • Moved all TokenGeneratorRequest-prefixed objects into max.interfaces and renamed with the TextGenerationRequest prefix.

    • Moved TextGenerationStatus to GenerationStatus.

    • Moved TextResponse and TextGenerationResponse to TextGenerationOutput.

    • Moved EmbeddingsResponse to EmbeddingsOutput.

  • Added ops.scatter_nd operation for scattering updates into a tensor at specified indices.

  • Added ops.avg_pool2d and ops.max_pool2d.

  • Added max.torch.graph_op interface to make it simple to embed larger MAX computations and models inside PyTorch. These can use max.nn modules internally and may be used within torch.nn modules, allowing the use of MAX subcomponents for access to our high performance graph compiler and Mojo kernel library.

    import torch
    import numpy as np
    import max
    from max.dtype import DType
    from max.graph import ops
    
    @max.torch.graph_op
    def max_grayscale(pic: max.graph.TensorValue):
        scaled = pic.cast(DType.float32) * np.array([0.21, 0.71, 0.07])
        grayscaled = ops.sum(scaled, axis=-1).cast(pic.dtype)
        # max reductions don't remove the dimension, need to squeeze
        return ops.squeeze(grayscaled, axis=-1)
    
    @torch.compile
    def grayscale(pic: torch.Tensor):
        output = pic.new_empty(pic.shape[:-1])  # Remove color channel dimension
        max_grayscale(output, pic)  # Call as destination-passing style
        return output
    
    img = (torch.rand(64, 64, 3, device=device) * 255).to(torch.uint8)
    result = grayscale(img)
  • Moved AlgebraicDim, Dim, StaticDim, and SymbolicDim out of max.type and into max.graph.dim. You can still import them directly from max.graph.

  • Moved Shape out of max.type and into max.graph.shape. You can still import it directly from max.graph.

  • Removed the ability to pass Python objects into models and have them returned as Mojo PythonObject types in the kernels.

  • Removed RandomWeights.

  • Removed Model.execute_legacy(). Instead use the standard execute() or __call__() methods.

  • Removed TorchScript-related helper functions and APIs, including support for .pt TorchScript files in custom extensions.

Mojo language

For all the updates to the Mojo language, standard library, and tools, including all GPU programming changes, see the Mojo changelog.

Was this page helpful?