v25.5 (2025-08-05)
Highlights
-
OpenAI-compatible batch API: The
/v1/batchesAPI is now available with Mammoth.We recently announced a partnership with SF Compute to make this API available through their dynamic GPU pricing marketplace. Their Large Scale Inference Batch API looks different from the
/v1/batchesAPI in Mammoth because it's a superset. -
New
mojoConda package: For Mojo-specific projects that run on CPUs and GPUs, you can now install the bare essentials with themojoConda package that's less than 900 MB on disk. For example, this now works:pixi add mojoThe
mojoPython package is not available for pip/uv yet.For a complete model-development and serving toolkit, you should still install the
modularpackage (which includesmojoas a dependency). -
Open-source graph APIs: We've added the
max.graphPython APIs to our GitHub repo. We've made great strides in recent months to simplify these APIs that help you build high-performance models you can serve with MAX.
Documentation
-
New Serve custom model architectures tutorial, with example code on GitHub.
-
New guide for using LoRA adapters with MAX.
-
Updated the Deploy Llama 3 on GPU tutorial with instructions using AMD MI300X (on Azure).
-
Added Pixi basics, which is where we redirect all the now-removed Magic docs (see our announcement migrating Magic to Pixi).
MAX models
- Added support for Idefics3 model.
MAX framework
-
Removed all
torchpackage dependencies.-
Reduces the total installation size of
modular(including dependencies) from 2.2 GB for CPUs and 6.5 GB for GPUs down to 1.5 GB, for all Python packages. Conda packages pull additional system dependencies so sizes may vary, but one example brings the size down from 9.8 GB to 2.0 GB. -
pip installno longer requires the--extra-index-url https://download.pytorch.org/whl/cpuoption (which was to avoid installing the GPU version oftorchthat has a lot of CUDA dependencies). -
uv pip installno longer requires the--index-strategy unsafe-best-matchoption (which was to avoid package resolution issues with the above--extra-index-urloption).
-
-
Removed HuggingFace fallback for model pipelines not natively supported in MAX (
PipelineEngine.HUGGINGFACE), because it's almost never used and it creates significant tech debt.
Inference server
-
Added the
/healthendpoint for service readiness checks, used by tools like lm-eval to determine when the service is ready to accept requests. -
Prefix caching now uses a Mojo token hashing operation. Previously we used the
hash()method from the Python stdlib. However, this resulted in noticeable CPU overhead and reduced GPU utilization. In this release, we migrated the token hashing operation to an accelerated Mojo implementation. -
Re-implemented the OpenAI API's
logprobsandechorequest parameters to eliminate an expensive device transfer. The--enable-echoflag, which previously incurred a significant performance penalty, is now 9-12x faster. -
Added support for
file://URIs in image inputs for multimodal models. Local file access is controlled via theMAX_SERVE_ALLOWED_IMAGE_ROOTSenvironment variable, which specifies a list of allowed root directories. Files are read asynchronously using aiofiles for better performance under high load. -
Improved function calling (tool use) to more reliably extract JSON tool calling responses for Llama models in an OpenAI-compatible format.
-
Switched from XGrammar to llguidance for generating structured output (constrained decoding).
max CLI
-
Added
--vision-config-overridesCLI option to override vision model configuration parameters. For example, to decrease InternVL's maximum dynamic patches from 12 to 6:max serve --model-path OpenGVLab/InternVL3-38B-Instruct \ --vision-config-overrides '{"max_dynamic_patch": 6}' -
Removed
--ignore-eosCLI argument. The full set of OpenAI chat and completion sampling parameters are now supported in the http requests. As such, the parameter can just be set via the http payload.
Python API
-
Added the
max.interfacesmodule. This module should serve as a relatively import free module to hold all shared interfaces across the MAX stack. Slowly we will be moving common interfaces to this module. So far, we've moved the following frommax.pipelines.core:-
Moved
TextGenerationStatus,TextResponse,TextGenerationResponse,InputContext, andPipelineTaskintomax.interfaces. -
Moved all
TokenGeneratorRequest-prefixed objects intomax.interfacesand renamed with theTextGenerationRequestprefix. -
Moved
TextGenerationStatustoGenerationStatus. -
Moved
TextResponseandTextGenerationResponsetoTextGenerationOutput. -
Moved
EmbeddingsResponsetoEmbeddingsOutput.
-
-
Added
ops.scatter_ndoperation for scattering updates into a tensor at specified indices. -
Added
ops.avg_pool2dandops.max_pool2d. -
Added
max.torch.graph_opinterface to make it simple to embed larger MAX computations and models inside PyTorch. These can usemax.nnmodules internally and may be used withintorch.nnmodules, allowing the use of MAX subcomponents for access to our high performance graph compiler and Mojo kernel library.import torch import numpy as np import max from max.dtype import DType from max.graph import ops @max.torch.graph_op def max_grayscale(pic: max.graph.TensorValue): scaled = pic.cast(DType.float32) * np.array([0.21, 0.71, 0.07]) grayscaled = ops.sum(scaled, axis=-1).cast(pic.dtype) # max reductions don't remove the dimension, need to squeeze return ops.squeeze(grayscaled, axis=-1) @torch.compile def grayscale(pic: torch.Tensor): output = pic.new_empty(pic.shape[:-1]) # Remove color channel dimension max_grayscale(output, pic) # Call as destination-passing style return output img = (torch.rand(64, 64, 3, device=device) * 255).to(torch.uint8) result = grayscale(img) -
Moved
AlgebraicDim,Dim,StaticDim, andSymbolicDimout ofmax.typeand intomax.graph.dim. You can still import them directly frommax.graph. -
Moved
Shapeout ofmax.typeand intomax.graph.shape. You can still import it directly frommax.graph. -
Removed the ability to pass Python objects into models and have them returned as Mojo
PythonObjecttypes in the kernels. -
Removed
RandomWeights. -
Removed
Model.execute_legacy(). Instead use the standardexecute()or__call__()methods. -
Removed TorchScript-related helper functions and APIs, including support for
.ptTorchScript files in custom extensions.
Mojo language
For all the updates to the Mojo language, standard library, and tools, including all GPU programming changes, see the Mojo changelog.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!