v25.7 (2025-11-20)
Highlights
-
The MAX Python API is now fully open-sourced on GitHub!
As we expand our model repository, we're making significant progress on these APIs to simplify the effort to build production-ready GenAI models in Python. Some APIs are still experimental, but you can build an LLM with it today.
Documentation
-
New online book to build an LLM from scratch with MAX, using our experimental model APIs. This is a guided lesson to building GPT-2 with our Python API, explaining each component of the transformer model along the way. Like the Python APIs, the book is a work in progress—please report any issues in GitHub.
-
All the planned parts of GPU Puzzles are now complete! Support for Apple silicon GPUs is also making steady progress.
-
Tutorials on docs.modular.com are now integrated into the Guides section, indicated with a book icon in the left navigation.
-
The
maxCLI docs are now generated from the CLI source.
MAX models
- Gemma3 now supports logprobs.
MAX framework
- Added support for bfloat16 models running on GPUs with ARM-based CPU hosts, such as Grace Hopper (GH200) and Grace Blackwell (GB200).
- Updated minimum NVIDIA GPU driver requirement to 580.
max CLI
-
max benchmarkcan now run LoRA benchmarking for supported models and target modules. -
max benchmark --collect-gpu-statscan now collect AMD GPU statistics. -
max serve --do-penaltieswas renamed to--enable-penaltiesand enabled by default. To disable penalties, you can specify--no-enable-penalties
Python API
-
Added support for Python 3.14.
-
Removed support for Python 3.9.
-
All MAX Python API modules are now open-sourced. In addition to those previously released, we've added
driver,dtype,engine,experimental,interfaces,kv_cache,mlir,nn,profiler,support,torch, and more in our GitHub repo. -
Added
max.profilermodule with theTracerclass to create and manage profiling spans based on runtime conditions, and the [`@traced()] decorator to profile a whole function. -
Added
max.diagnostics.gpuAPIs to expose common GPU statistics as might be reported bynvidia-smiorrocm-smi. -
Added the
max.kv_cachepackage, which provides APIs to manage key-value caches used in transformer models. Not to be confused with the existingmax.nn.kv_cachepackage that includes kernels for KV caching. -
Removed the
KVCacheManagerclass and combined it with the singlePagedKVCacheManagerimplementation. During merger,prefetch()was renamedmaybe_reserve(). -
Added
NullKVCacheManagerfor compile-only mode, which avoids GPU memory allocation when compiling models without a physical GPU present. -
Added
ResetPrefixCacheBackendandResetPrefixCacheFrontendclasses for coordinating prefix cache resets between frontend and backend components. -
Added more APIs for text-to-speech (TTS) models such as
AudioGenerationInputsandAudioGenerationOutput -
Changed
LoRAConfig.max_num_lorasdefault to1(was100). -
New
RequestIDclass replaces previous type alias to provide better type safety and consistency across the API. -
Removed
InputContextand replaced it with the modality-output specificTextGenerationContextandEmbeddingsContext. -
Added
ImageMetadataandVLMTextGenerationContext. -
Added
max.nn.commwithAllreduceandSignalsfor peer-to-peer communication in allreduce. -
ops.gather()no longer has a defaultaxis, it must be specified explicitly (better matching PyTorch and NumPy). -
Graph.add_subgraph()has been updated to take adevicesargument. This allows subgraphs to take advantage of device-aware work scheduling.
Mojo API
- Renamed the
tensor_internalpackage totensorand removed the previoustensorstub—the API behaves the same but the Mojotensordocs moved.
Mojo language
For all the updates to the Mojo language, standard library, and tools,
including all GPU programming and Layout/LayoutTensor changes, see the Mojo
changelog.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!