Skip to main content

v25.7 (2025-11-20)

Highlights

Documentation

  • New online book to build an LLM from scratch with MAX, using our experimental model APIs. This is a guided lesson to building GPT-2 with our Python API, explaining each component of the transformer model along the way. Like the Python APIs, the book is a work in progress—please report any issues in GitHub.

  • All the planned parts of GPU Puzzles are now complete! Support for Apple silicon GPUs is also making steady progress.

  • Tutorials on docs.modular.com are now integrated into the Guides section, indicated with a book icon in the left navigation.

  • The max CLI docs are now generated from the CLI source.

MAX models

  • Gemma3 now supports logprobs.

MAX framework

  • Added support for bfloat16 models running on GPUs with ARM-based CPU hosts, such as Grace Hopper (GH200) and Grace Blackwell (GB200).
  • Updated minimum NVIDIA GPU driver requirement to 580.

max CLI

  • max benchmark can now run LoRA benchmarking for supported models and target modules.

  • max benchmark --collect-gpu-stats can now collect AMD GPU statistics.

  • max serve --do-penalties was renamed to --enable-penalties and enabled by default. To disable penalties, you can specify --no-enable-penalties

Python API

  • Added support for Python 3.14.

  • Removed support for Python 3.9.

  • All MAX Python API modules are now open-sourced. In addition to those previously released, we've added driver, dtype, engine, experimental, interfaces, kv_cache, mlir, nn, profiler, support, torch, and more in our GitHub repo.

  • Added max.profiler module with the Tracer class to create and manage profiling spans based on runtime conditions, and the [`@traced()] decorator to profile a whole function.

  • Added max.diagnostics.gpu APIs to expose common GPU statistics as might be reported by nvidia-smi or rocm-smi.

  • Added the max.kv_cache package, which provides APIs to manage key-value caches used in transformer models. Not to be confused with the existing max.nn.kv_cache package that includes kernels for KV caching.

  • Removed the KVCacheManager class and combined it with the single PagedKVCacheManager implementation. During merger, prefetch() was renamed maybe_reserve().

  • Added NullKVCacheManager for compile-only mode, which avoids GPU memory allocation when compiling models without a physical GPU present.

  • Added ResetPrefixCacheBackend and ResetPrefixCacheFrontend classes for coordinating prefix cache resets between frontend and backend components.

  • Added more APIs for text-to-speech (TTS) models such as AudioGenerationInputs and AudioGenerationOutput

  • Changed LoRAConfig.max_num_loras default to 1 (was 100).

  • New RequestID class replaces previous type alias to provide better type safety and consistency across the API.

  • Removed InputContext and replaced it with the modality-output specific TextGenerationContext and EmbeddingsContext.

  • Added ImageMetadata and VLMTextGenerationContext.

  • Added max.nn.comm with Allreduce and Signals for peer-to-peer communication in allreduce.

  • ops.gather() no longer has a default axis, it must be specified explicitly (better matching PyTorch and NumPy).

  • Graph.add_subgraph() has been updated to take a devices argument. This allows subgraphs to take advantage of device-aware work scheduling.

Mojo API

  • Renamed the tensor_internal package to tensor and removed the previous tensor stub—the API behaves the same but the Mojo tensor docs moved.

Mojo language

For all the updates to the Mojo language, standard library, and tools, including all GPU programming and Layout/LayoutTensor changes, see the Mojo changelog.

Was this page helpful?