MAX v26.3

Highlights
Documentation
MAX models
MAX framework
MAX kernels
Breaking changes
Fixed
Mojo language

Highlights

MAX now supports video generation with Wan 2.1 / 2.2 diffusion models, including image-to-video and video-to-video pipelines.
New API for multi-GPU model execution from Python: the max.experimental.sharding module lets a single Module.compile() call distribute a model across a DeviceMesh using Replicated, Sharded, and Partial placement primitives. Gemma 3 ModuleV3 is the first multi-GPU model on this path.
The MAX NVFP4 grouped matmul kernel now outperforms FlashInfer on B200 across all tested decoding and prefill shapes for Kimi K2.5.

Documentation

Restructured the MAX LLM book around how to deploy a custom model with max serve.
Added new model developer guides covering broadcasting, indexing, and the model bring-up workflow.
Added a graph overview and a new graph and modules guide.
Added model debugging guides for accuracy, errors, GPU, and tracing.
Updated the speculative decoding guide.
Updated the guide to serve custom models.
Added API docs for max.pipelines.architectures.
Redesigned REST API reference, now built with Scalar.

MAX models

The residual_threshold parameter for FLUX first-block cache (FBCache) is now a per-request runtime parameter on ImageProviderOptions, allowing it to be tuned without recompiling the model graph.
Added the Mamba state space model architecture.
Added the Step-3.5-Flash architecture.
Added the Qwen-Image and Qwen-Image-Edit text-to-image architectures.
Added the Z-Image and Z-Image-Turbo text-to-image architectures.
MiniMax-M2 and MiniMax-M2.7:
- Added MiniMax-M2 and MiniMax-M2.7 architecture support, including FP8 weights, the lightning-attention hybrid backbone, and 4×H100 multi-GPU serving.
- Enabled DP+EP execution paths for MiniMax MoE layers, with automatic overlap scheduling and device-graph capture.
- Added per-rank token-limit checks and reduced input-offset device round trips on the MiniMax decode path.
Gemma 4 and Gemma 3 ModuleV3:
- Added the Gemma 4 architecture (ModuleV2), including multimodal vision support.
- Added the Gemma 3 ModuleV3 implementation with multi-GPU support via the DTensor / DistributedTensorType compile path.
- Fixed token-offset and prompt-image alignment regressions in Gemma 4 multimodal prefill, plus assorted Gemma 3 ModuleV3 performance fixes.
Qwen3 and Qwen3-VL:
- Added Qwen3 and Qwen3-VL architecture support, including the MoE variant and multimodal vision input.
Wan video diffusion:
- Fixed Wan 2.1 / 2.2 video diffusion pipelines silently running without classifier-free guidance. The tokenizer gated negative-prompt tokenization on true_cfg_scale > 1.0 (default 1.0), so negative tokens were never produced and the executor fell back to unguided generation even when guidance_scale > 1.0 and a negative prompt were supplied. Wan now enables classical CFG whenever guidance_scale > 1.0 and defaults an absent negative prompt to the empty string, matching the diffusers baseline.
- Added the UniPC multistep scheduler for Wan diffusion.
- Added Wan image-to-video and video-to-video pipeline variants, plus additional generation kwargs and prompt-handling fixes.
FLUX.2:
- Added TaylorSeer denoising cache support to the FLUX.2 Klein pipeline, enabling significant speedups for image-to-image generation by skipping redundant transformer passes during the denoising loop.
- Added TeaCache support to DiffusionPipeline as a peer of TaylorSeer.
- Added FLUX.2 ModuleV2 pipeline, FLUX.2 Klein support, NVFP4 quantization, aspect-ratio preserving image preprocessing, and BFL checkpoint weight fixes.
Kimi K2.5 vision:
- Improved Kimi K2.5 multimodal support, including vision encoder fixes and tokenizer parity with the upstream model.
DeepSeek V3 and Kimi K2.5 distributed execution:
- Improved tensor-parallel and expert-parallel execution paths for DeepSeek V3 and Kimi K2.5, including subgraph deduplication, MoE dispatch tuning, and reduced compile-time overhead.

MAX framework

Inference server

Added periodic "still building/compiling" log messages during model compilation so that long operations produce visible signs of progress.
Consolidated KV connector CLI flags (--host-kvcache-swap-space-gb, --disk-offload-dir, --disk-offload-max-gb, --disk-offload-direct-io, --lmcache-config-file) into the --kv-connector-config JSON dict.
Removed the --allow-safetensors-weights-fp32-bf16-bidirectional-cast CLI flag. Float32 <-> bfloat16 safetensors weight casting is now unconditionally enabled.
Added --model-override CLI flag for per-component ModelManifest overrides (e.g. --model-override transformer.quantization_encoding=float4_e2m1fnx2), enabling mixed quantization in diffusion pipelines.
Removed jump forward decoding (compute_ff_tokens) from structured output. The bitmask constraint alone ensures valid structured output, matching the approach used by vLLM and SGLang.
Added json_object response-format support to MAX Serve structured output via /v1/chat/completions.
Improved error handling for image request failures in MAX Serve.
Added multi-step and overlap-scheduler support for structured output in the TextGenerationPipeline. Extended tokenizer support to include TikToken-based tokenizers, enabling structured output with Kimi K2.5.
Improved cached-token reporting, fixed cache hit/miss metrics to emit only on context-encoding batches, moved a subset of telemetry from detailed to basic, and added per-draft-position acceptance-rate logging for speculative decoding.
Tightened the MODULAR_MAX_SERVE_* environment-variable prefix; unprefixed overrides previously honored by max-serve no longer apply.
Added min_p and top_k sampling controls and additional chat-completion kwargs to the OpenAI-compatible routes.
Unified EAGLE speculative decoding:
- Added unified EAGLE pipelines for Llama 3, DeepSeek V3 + MTP, and Kimi K2.5, sharing a single PipelineModel.
- Added support for --num-speculative-tokens > 1 across the unified EAGLE Llama, DeepSeek+MTP, and Kimi+EAGLE paths.
- Added overlap-scheduler support for unified EAGLE, including multi-GPU DP setups (e.g. DP Kimi).
- Enabled CUDA graphs for EAGLE and MTP.
Distributed KV transfer (dKV):
- Added the DKVConnector with NIXL transfer support for the distributed KV cache.
- Unified KV connector configuration under --kv-connector-config.
- Added EFA compatibility, disconnect support, parent-hash eviction, and per-connector metrics for the dKV transfer engine.
- Added a configurable decode-stall watchdog for 1P1D deployments.
- Added disk-location support to the Python dKV client.
Heterogeneous serving and overlap scheduling:
- Added two-phase prefill execution under the overlap scheduler for the distributed-inference (DI) prefill role.
- Auto-enabled overlap scheduling for DI pipeline roles and disabled auto device-graph capture for prefill-only workers.
- Added support for heterogeneous TP prefill / DP decode in MLA KV transfer (e.g. tp4 prefill into a DP decode pool).

`max` CLI

Added sweep benchmarking capabilities to max benchmark: iterate over multiple concurrency and request-rate combinations, flush the prefix cache between runs, and collect per-run structured JSON results.
Standardized the --model flag across max serve, max generate, max encode, and max warm-cache.
Improved max serve CLI flag descriptions.

Python API

Added Model.release_captured_graph(), which drops a previously captured device graph identified by graph key (or per-device keys) and frees its device-side working memory once any in-flight replay completes. Releasing a key that was never captured is a no-op. Callers remain responsible for dropping any output Buffer handles returned by the corresponding Model.capture() call.
Added ops.roi_align (with F.roi_align functional wrapper) for ROI Align pooling over NHWC inputs, with configurable spatial scale, sampling ratio, alignment mode, and AVG/MAX pooling. Includes a matching MO eager handler.
Added MO eager handlers for ConstantExternalOp, ConstantScalarOp, ReduceRmsNormOp, and ReduceGroupNormOp, so graphs with external weights, scalar constants, RMS norm, or group norm run eagerly without falling back to compilation.
Fixed tensor slicing with negative integer indices (e.g. hidden[:, -1]) which previously raised a RuntimeError at compile time.
Fixed ops.reshape / TensorValue.reshape rejecting valid -1 reshapes on tensors whose leading dim is a symbolic sum-of-products (e.g. [(batch_size * num_steps) + total_seq_len, 1536] reshaped to [-1, n_heads, head_dim] with n_heads * head_dim == 1536). The inferred dim now simplifies without requiring a rebind.
Setting MODULAR_MAX_DEBUG_UNINITIALIZED_READ_CHECK=true (or the max-debug.uninitialized-read-check config key, or InferenceSession.debug.uninitialized_read_check = True) enables detection of uninitialized memory reads in Mojo kernels. InferenceSession automatically enables the debug allocator poison and compiles kernels with load-time poison checks for all float types. When a load matches a poison pattern, the process aborts with a descriptive message.
Added support for the bfloat16 data type on ARM CPU devices in MAX graphs. Previously, session.load() raised a ValueError when a graph contained bf16 tensors targeting an ARM CPU.
Added DevicePlacementPolicy (Ignore, Warn, Error) to Graph to control behavior when CPU-only ops (ops.scatter, ops.cumsum, ops.nonzero, ops.tile) receive GPU tensors. The default (Warn) emits a UserWarning and falls back to CPU; Error raises ValueError instead. ops.cond and ops.while_loop always raise ValueError for GPU predicates.
Fixed slow axis=None reductions (mean, sum, prod, max, min) in max.experimental.functional. The previous implementation flattened the tensor before reducing, serializing the work onto a single GPU block. Reductions now iterate axis-by-axis to preserve parallelism.
Renamed the public quantization APIs from Float8* to Quant* (including Float8Config → QuantConfig, parse_float8_config() → parse_quant_config(), and the quant modules in max.nn and max.pipelines.lib), reflecting that the config now covers FP8, NVFP4, and MXFP4 quantization.
max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured.
Introduced CPUMetrics alongside the existing GPU diagnostics and open source it under from max.diagnostics.
Added Model.kernel_summaries for inspecting compiled kernels through the Python API.
Added a unified DebugConfig Python class (with nanobind bindings) and exposed DebugConfig and GraphDebugConfig in max.engine and max.graph.
Added a graph API for initializing and registering the runtime context (M::Context) from Python.
Improved max.experimental.functional.custom: compiled custom-op kernels are now cached, and eager-mode F.custom no longer recompiles on every call.
Fixed Module.compile() when unrealized tensors are used as weights.
Added the InputModality enum for specifying model input types and threaded it through the multimodal pipeline architectures.
Updated Tensor.to() and Module.to() to accept distributed device targets, including DeviceMapping and DeviceMesh.
max.experimental.Tensor is now distribution-aware: it carries a tuple of per-shard storages, driver.Buffers (realized) or graph values (TensorValue / BufferValue, unrealized), paired with a DeviceMapping that maps those local shards onto the DeviceMesh.
Reworked max.experimental.functional from a single functional.py into a functional/ package, a new distribution-and mesh-aware dispatch layer on top of the graph-compiler Python API, split cleanly into three op categories: creation_ops (tensor factories), spmd_ops (rule-based per-op SPMD dispatch), and collective_ops (allreduce_sum, allgather, reduce_scatter etc., now applied per device-group along a chosen mesh axis so they dispatch correctly on multi-dimensional meshes, plus a transfer_to convenience op between DeviceMappings).
Added max.experimental.sharding with the core types for distributed tensors (DeviceMesh; DeviceMapping with PlacementMapping and NamedMapping; placement primitives Replicated / Sharded / Partial; DistributedTensorType / DistributedBufferType; TensorLayout), plus a sharding.rules submodule of pure mapping-propagation rules (elementwise, matmul, reduction, shape, conv, pooling) that, for each op, either error out or reshard inputs to the proposed DeviceMappings and derive the resulting output DeviceMapping.
max.experimental.nn.Module.compile() now accepts DistributedTensorType symbolic inputs (not just TensorType), so distributed models can be built via the graph-compilation path in addition to running eagerly; gemma3_modulev3 is the first multi-GPU model wired up. DTensor support in MAX is still ongoing work and these APIs may evolve.
Added new graph ops (with matching max.experimental.functional wrappers): scatter_max, scatter_min, scatter_mul, scatter_nd_max, scatter_nd_min, scatter_nd_mul, non_maximum_suppression, resize_linear, resize_nearest, and resize_bicubic. The existing max.graph.ops.resize now delegates to these for BILINEAR, NEAREST, and BICUBIC interpolation modes. max.graph.ops.pad (and the functional wrapper) also accepts mode='reflect' and mode='edge' in addition to mode='constant'.
Expanded experimental eager-interpreter coverage so significantly more graphs run end-to-end without falling back to compilation. Added handlers for gather, gather_nd, argmax, argmin, split, scatter, scatter_nd, scatter_nd_add, scatter_add, scatter_max, scatter_min, scatter_mul, scatter_nd_max, scatter_nd_min, scatter_nd_mul, tile, band_part, top_k, bottom_k, nonzero, non_maximum_suppression, pad (constant on CPU/GPU; reflect and edge on CPU), conv2d, conv2d_transpose, max_pool2d, avg_pool2d (floor and ceil mode), resize_linear, resize_nearest, resize_bicubic, mo.mutable.store, mo.mutable.store.slice, and the distributed collectives distributed.allreduce.sum, distributed.allgather, distributed.scatter, distributed.broadcast, and distributed.reducescatter.sum. Most run on both CPU and GPU; CPU-only handlers are noted as such.
Rewrote the eager-interpreter mo.mutable.store.slice handler to write slices via a device-side Mojo kernel instead of a host numpy round-trip. GPU buffers no longer round-trip D→H→D on every call, and bfloat16 and float8_* dtypes are now supported (float4_e2m1fn remains unsupported).
Added defensive eager-interpreter handlers for mo.shape.from_tensor, mo.index.to_tensor, mo.buffer.create, mo.buffer.transfer, and mo.gather_sum so eager runs no longer crash if these internal ops survive canonicalization.
Improved experimental eager-interpreter performance by enabling multi-threaded CPU execution and removing unnecessary GPU device synchronization between op dispatches.
Added max.nn.StackedLinear for QKV-style stacked projections, with a fused (stacked=True) and an unfused (stacked=False) layout. Unfused mode opts into a new Module._omit_module_attr_name flag, which drops the wrapper's own attribute name from descendant weight FQNs, so a self.qkv_proj = StackedLinear(names=["q_proj", "k_proj", "v_proj"], stacked=False) exposes weights at self_attn.q_proj.weight rather than self_attn.qkv_proj.q_proj.weight. This lets HuggingFace checkpoint names flow into models without per-architecture remapping in their weight_adapters.py.
Module.compile() now accepts a custom_extensions parameter for loading custom Mojo kernel libraries at graph construction time, fixing validation failures for kernels with struct-level parameters.
Fixed torch.compile(fullgraph=True) failing with an "Unsupported context manager" error when accessing CustomOpLibrary ops inside the compiled function. Ops are now eagerly compiled during library initialization.
Runtime and device graph performance:
- Reduced device-graph launch overhead for single-graph models.
- Parallelized device-graph instantiation and moved instantiation off the main execution threads.
- Added parallel device-graph launches and a task-ID hint on AsyncRT algorithms.
- Added a GPU health check during DeviceContext initialization.
- Added NaN/Inf detection at compiled-region boundaries.
- Improved Metal driver support with custom statuses and Metal log capture for Apple GPU print output.
- Made CPUDeviceContext asynchronous and added enqueue_cpu_function / enqueue_cpu_range helpers for CPU kernel execution.
- Auto-enabled device-graph capture for DeepSeek V3, Kimi, and Kimi K2.5 serving paths.

Custom ops

Added host-function and in-place memcpy custom ops, including mo.launch_host_func, mo.inplace_memcpy, an enqueueHostFunc Mojo binding on DeviceStream, and a cuLaunchHostFunc binding for the CUDA device stream.

MAX kernels

Added GPU kernel examples from the Programming Massively Parallel Processors (PMPP) textbook covering reductions, scans, histograms, sorting, sparse matrix operations, graph algorithms, convolutions, FlashAttention, and more.
Improved NVFP4 grouped matmul kernel performance, now outperforming FlashInfer across all tested decoding and prefill shapes for Kimi K2.5 on B200.
Optimized GPU layer_norm kernels with SIMD reductions, gamma/beta prefetch, and a simd_width*2 warp tiling dispatch path.
Optimized GPU pad_constant kernel with SIMD vectorization (simd_width=4) and added a kbench benchmark suite (bench_pad).
Improved GPU topk and argsort kernel performance by nearly 2x.
Optimized GPU concat with a flat-indexing kernel that avoids multi-dimensional index decomposition, using 128-bit vectorized loads with automatic fallback for unaligned shapes.
Optimized GPU topk stage-1 kernel with a per-thread register heap that caches the top-8 elements during a single scan pass, eliminating redundant global memory re-reads for the first 8 extraction iterations.
Moved partial_simd_load and partial_simd_store from buffer.buffer to linalg.utils and removed the buffer package. Update imports from from buffer.buffer import ... to from linalg.utils import ....
Blackwell (SM100) GPU performance:
- Enabled the Mojo SM100 GEMM by default.
- Added MXFP4 and MXFP8 block-scaled matmul on SM100, plus a KIND_MXF4 execution path.
- Added a general grouped block-scaled matmul dispatch and MXFP4 support for the grouped path.
- Enabled PDL for SM100 grouped NVFP4 / MXFP4 / MXFP8 GMM.
- Improved the SM100 GEMV dispatcher and added GEMV split-K for GEMMs with small M and N.
- Increased the SM100 GEMM C-tile N dispatch up to 64.
AMD GPU performance:
- Added B300 support, including device-agnostic default block counts for allreduce and allgather.
- Added a CDNA4 block-scaled MFMA wrapper.
- Added MI355X TileTensor MHA (about +13% prefill at depth 128) and TileTensor-based AMD attention kernels generally.
- Always enabled the gfx950 MHA prefill kernel and modernized AMD MHA/MLA decode with 16x16 MMA and FP8.
- Added depth-512 paths for AMD RDNA GPUs and a 2-D convolution kernel for RDNA 3+ GPUs.
- Added MXFP4 matmul and grouped matmul support on AMD.
Attention and state-space kernels:
- Added sparse MLA decode (with qbf16 / FP8 KV variants) for SM100.
- Added speculative-decoding sequence-length folding with numhead for the TP MLA decode dispatch.
- Added gated delta-rule recurrence kernels for hybrid-attention models.
Expert-parallel (EP) kernels:
- Added multi-device MO ops for EP dispatch and combine.
- Added a grouped dynamic NVFP4 quantization kernel for MoE.
- Added MXFP4 support to ep.dispatch and the mo.distributed.ep.dispatch.mxfp4 op.
- Added a skip_a2a mode to EP dispatch and combine.
- Fixed AMD GPU atomics in EP dispatch.
Collective communication kernels:
- Unified the multimem and standard code paths in ReduceScatter.
- Enabled PDL for allgather and updated ReduceScatter to use with_PDL().
- Launched allgather kernels in parallel and set the allgather block count via a tuning table.
- Added support for non-multiples of SIMD width in allreduce.
Fused transformer kernels:
- Added a fused rope_split_store kernel and wired it into AttentionWithRope.
- Added a fused RMSNorm + RoPE GPU kernel and a graph-compiler fusion pattern for mo.reduce.rms_norm.RoPE.
- Added a GEMV + partial RMSNorm fusion path.

Breaking changes

Removed individual KV connector CLI flags (--host-kvcache-swap-space-gb, --disk-offload-dir, --disk-offload-max-gb, --disk-offload-direct-io, --lmcache-config-file). Use --kv-connector-config with a JSON dict instead.
max/python/max/benchmark/benchmark_throughput.py has been deprecated and will be removed in a future MAX release.
Removed Dim and DimList types from buffer.dimlist. Custom kernel code using these types should migrate to IntTuple and TileLayout from the layout package.
Removed PreTrainedPipelineTokenizer. Use the standard pipeline tokenizer resolution path instead.
Moved DenoisingCacheConfig from PipelineConfig to PipelineRuntimeConfig. Update call sites that constructed PipelineConfig(denoising_cache_config=...) to set the field on PipelineRuntimeConfig instead.
Replaced FluxPipelineOutput and Flux2PipelineOutput with a unified DiffusionPipelineOutput. Code that imports the old output types must switch to DiffusionPipelineOutput.
PipelineConfig now expects a models=ModelManifest(...) configuration for multi-component pipelines (transformer, VAE, text encoders, etc.). Pipelines that previously passed individual model paths or configs at the top level must migrate to a ModelManifest.
max-serve now requires the MODULAR_MAX_SERVE_* prefix for environment overrides. Unprefixed environment variables previously honored by max-serve no longer apply.

Fixed

Fixed MAX tools aborting at startup with std::filesystem::filesystem_error when $HOME is not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412)
Fixed enqueue_fill() taking O(N) HIP API calls for float64 buffers on AMD GPUs when the high and low 32-bit halves of the fill value differ (e.g., 2.0), reducing the call count to O(log N). (Issue #6417)
Fixed integer indexing into a graph tensor (e.g. x[0] on a (2, 3) tensor) failing graph compilation with 'mo.static.reshape' op input and output elements do not match. A reshape-through-slice optimization pattern was incorrectly rewriting the slice + squeeze pattern produced by integer indexing, generating a reshape whose element count did not match the input. (Issue #6440)

Mojo language

For all the updates to the Mojo language, standard library, and tools, see the Mojo release notes

Highlights​

Documentation​

MAX models​

MAX framework​

Inference server​

max CLI​

Python API​

Custom ops​

MAX kernels​

Breaking changes​

Fixed​

Mojo language​

Highlights

Documentation

MAX models

MAX framework

Inference server

`max` CLI

Python API

Custom ops

MAX kernels

Breaking changes

Fixed

Mojo language