Skip to main content

MAX v26.3

Highlights​

  • MAX now supports video generation with Wan 2.1 / 2.2 diffusion models, including image-to-video and video-to-video pipelines.

  • New API for multi-GPU model execution from Python: the max.experimental.sharding module lets a single Module.compile() call distribute a model across a DeviceMesh using Replicated, Sharded, and Partial placement primitives. Gemma 3 ModuleV3 is the first multi-GPU model on this path.

  • The MAX NVFP4 grouped matmul kernel now outperforms FlashInfer on B200 across all tested decoding and prefill shapes for Kimi K2.5.

Documentation​

MAX models​

  • The residual_threshold parameter for FLUX first-block cache (FBCache) is now a per-request runtime parameter on ImageProviderOptions, allowing it to be tuned without recompiling the model graph.

  • Added the Mamba state space model architecture.

  • Added the Step-3.5-Flash architecture.

  • Added the Qwen-Image and Qwen-Image-Edit text-to-image architectures.

  • Added the Z-Image and Z-Image-Turbo text-to-image architectures.

  • MiniMax-M2 and MiniMax-M2.7:

    • Added MiniMax-M2 and MiniMax-M2.7 architecture support, including FP8 weights, the lightning-attention hybrid backbone, and 4Γ—H100 multi-GPU serving.
    • Enabled DP+EP execution paths for MiniMax MoE layers, with automatic overlap scheduling and device-graph capture.
    • Added per-rank token-limit checks and reduced input-offset device round trips on the MiniMax decode path.
  • Gemma 4 and Gemma 3 ModuleV3:

    • Added the Gemma 4 architecture (ModuleV2), including multimodal vision support.
    • Added the Gemma 3 ModuleV3 implementation with multi-GPU support via the DTensor / DistributedTensorType compile path.
    • Fixed token-offset and prompt-image alignment regressions in Gemma 4 multimodal prefill, plus assorted Gemma 3 ModuleV3 performance fixes.
  • Qwen3 and Qwen3-VL:

    • Added Qwen3 and Qwen3-VL architecture support, including the MoE variant and multimodal vision input.
  • Wan video diffusion:

    • Fixed Wan 2.1 / 2.2 video diffusion pipelines silently running without classifier-free guidance. The tokenizer gated negative-prompt tokenization on true_cfg_scale > 1.0 (default 1.0), so negative tokens were never produced and the executor fell back to unguided generation even when guidance_scale > 1.0 and a negative prompt were supplied. Wan now enables classical CFG whenever guidance_scale > 1.0 and defaults an absent negative prompt to the empty string, matching the diffusers baseline.
    • Added the UniPC multistep scheduler for Wan diffusion.
    • Added Wan image-to-video and video-to-video pipeline variants, plus additional generation kwargs and prompt-handling fixes.
  • FLUX.2:

    • Added TaylorSeer denoising cache support to the FLUX.2 Klein pipeline, enabling significant speedups for image-to-image generation by skipping redundant transformer passes during the denoising loop.
    • Added TeaCache support to DiffusionPipeline as a peer of TaylorSeer.
    • Added FLUX.2 ModuleV2 pipeline, FLUX.2 Klein support, NVFP4 quantization, aspect-ratio preserving image preprocessing, and BFL checkpoint weight fixes.
  • Kimi K2.5 vision:

    • Improved Kimi K2.5 multimodal support, including vision encoder fixes and tokenizer parity with the upstream model.
  • DeepSeek V3 and Kimi K2.5 distributed execution:

    • Improved tensor-parallel and expert-parallel execution paths for DeepSeek V3 and Kimi K2.5, including subgraph deduplication, MoE dispatch tuning, and reduced compile-time overhead.

MAX framework​

Inference server​

  • Added periodic "still building/compiling" log messages during model compilation so that long operations produce visible signs of progress.

  • Consolidated KV connector CLI flags (--host-kvcache-swap-space-gb, --disk-offload-dir, --disk-offload-max-gb, --disk-offload-direct-io, --lmcache-config-file) into the --kv-connector-config JSON dict.

  • Removed the --allow-safetensors-weights-fp32-bf16-bidirectional-cast CLI flag. Float32 <-> bfloat16 safetensors weight casting is now unconditionally enabled.

  • Added --model-override CLI flag for per-component ModelManifest overrides (e.g. --model-override transformer.quantization_encoding=float4_e2m1fnx2), enabling mixed quantization in diffusion pipelines.

  • Removed jump forward decoding (compute_ff_tokens) from structured output. The bitmask constraint alone ensures valid structured output, matching the approach used by vLLM and SGLang.

  • Added json_object response-format support to MAX Serve structured output via /v1/chat/completions.

  • Improved error handling for image request failures in MAX Serve.

  • Added multi-step and overlap-scheduler support for structured output in the TextGenerationPipeline. Extended tokenizer support to include TikToken-based tokenizers, enabling structured output with Kimi K2.5.

  • Improved cached-token reporting, fixed cache hit/miss metrics to emit only on context-encoding batches, moved a subset of telemetry from detailed to basic, and added per-draft-position acceptance-rate logging for speculative decoding.

  • Tightened the MODULAR_MAX_SERVE_* environment-variable prefix; unprefixed overrides previously honored by max-serve no longer apply.

  • Added min_p and top_k sampling controls and additional chat-completion kwargs to the OpenAI-compatible routes.

  • Unified EAGLE speculative decoding:

    • Added unified EAGLE pipelines for Llama 3, DeepSeek V3 + MTP, and Kimi K2.5, sharing a single PipelineModel.
    • Added support for --num-speculative-tokens > 1 across the unified EAGLE Llama, DeepSeek+MTP, and Kimi+EAGLE paths.
    • Added overlap-scheduler support for unified EAGLE, including multi-GPU DP setups (e.g. DP Kimi).
    • Enabled CUDA graphs for EAGLE and MTP.
  • Distributed KV transfer (dKV):

    • Added the DKVConnector with NIXL transfer support for the distributed KV cache.
    • Unified KV connector configuration under --kv-connector-config.
    • Added EFA compatibility, disconnect support, parent-hash eviction, and per-connector metrics for the dKV transfer engine.
    • Added a configurable decode-stall watchdog for 1P1D deployments.
    • Added disk-location support to the Python dKV client.
  • Heterogeneous serving and overlap scheduling:

    • Added two-phase prefill execution under the overlap scheduler for the distributed-inference (DI) prefill role.
    • Auto-enabled overlap scheduling for DI pipeline roles and disabled auto device-graph capture for prefill-only workers.
    • Added support for heterogeneous TP prefill / DP decode in MLA KV transfer (e.g. tp4 prefill into a DP decode pool).

max CLI​

  • Added sweep benchmarking capabilities to max benchmark: iterate over multiple concurrency and request-rate combinations, flush the prefix cache between runs, and collect per-run structured JSON results.
  • Standardized the --model flag across max serve, max generate, max encode, and max warm-cache.
  • Improved max serve CLI flag descriptions.

Python API​

  • Added Model.release_captured_graph(), which drops a previously captured device graph identified by graph key (or per-device keys) and frees its device-side working memory once any in-flight replay completes. Releasing a key that was never captured is a no-op. Callers remain responsible for dropping any output Buffer handles returned by the corresponding Model.capture() call.

  • Added ops.roi_align (with F.roi_align functional wrapper) for ROI Align pooling over NHWC inputs, with configurable spatial scale, sampling ratio, alignment mode, and AVG/MAX pooling. Includes a matching MO eager handler.

  • Added MO eager handlers for ConstantExternalOp, ConstantScalarOp, ReduceRmsNormOp, and ReduceGroupNormOp, so graphs with external weights, scalar constants, RMS norm, or group norm run eagerly without falling back to compilation.

  • Fixed tensor slicing with negative integer indices (e.g. hidden[:, -1]) which previously raised a RuntimeError at compile time.

  • Fixed ops.reshape / TensorValue.reshape rejecting valid -1 reshapes on tensors whose leading dim is a symbolic sum-of-products (e.g. [(batch_size * num_steps) + total_seq_len, 1536] reshaped to [-1, n_heads, head_dim] with n_heads * head_dim == 1536). The inferred dim now simplifies without requiring a rebind.

  • Setting MODULAR_MAX_DEBUG_UNINITIALIZED_READ_CHECK=true (or the max-debug.uninitialized-read-check config key, or InferenceSession.debug.uninitialized_read_check = True) enables detection of uninitialized memory reads in Mojo kernels. InferenceSession automatically enables the debug allocator poison and compiles kernels with load-time poison checks for all float types. When a load matches a poison pattern, the process aborts with a descriptive message.

  • Added support for the bfloat16 data type on ARM CPU devices in MAX graphs. Previously, session.load() raised a ValueError when a graph contained bf16 tensors targeting an ARM CPU.

  • Added DevicePlacementPolicy (Ignore, Warn, Error) to Graph to control behavior when CPU-only ops (ops.scatter, ops.cumsum, ops.nonzero, ops.tile) receive GPU tensors. The default (Warn) emits a UserWarning and falls back to CPU; Error raises ValueError instead. ops.cond and ops.while_loop always raise ValueError for GPU predicates.

  • Fixed slow axis=None reductions (mean, sum, prod, max, min) in max.experimental.functional. The previous implementation flattened the tensor before reducing, serializing the work onto a single GPU block. Reductions now iterate axis-by-axis to preserve parallelism.

  • Renamed the public quantization APIs from Float8* to Quant* (including Float8Config β†’ QuantConfig, parse_float8_config() β†’ parse_quant_config(), and the quant modules in max.nn and max.pipelines.lib), reflecting that the config now covers FP8, NVFP4, and MXFP4 quantization.

  • max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured.

  • Introduced CPUMetrics alongside the existing GPU diagnostics and open source it under from max.diagnostics.

  • Added Model.kernel_summaries for inspecting compiled kernels through the Python API.

  • Added a unified DebugConfig Python class (with nanobind bindings) and exposed DebugConfig and GraphDebugConfig in max.engine and max.graph.

  • Added a graph API for initializing and registering the runtime context (M::Context) from Python.

  • Improved max.experimental.functional.custom: compiled custom-op kernels are now cached, and eager-mode F.custom no longer recompiles on every call.

  • Fixed Module.compile() when unrealized tensors are used as weights.

  • Added the InputModality enum for specifying model input types and threaded it through the multimodal pipeline architectures.

  • Updated Tensor.to() and Module.to() to accept distributed device targets, including DeviceMapping and DeviceMesh.

  • max.experimental.Tensor is now distribution-aware: it carries a tuple of per-shard storages, driver.Buffers (realized) or graph values (TensorValue / BufferValue, unrealized), paired with a DeviceMapping that maps those local shards onto the DeviceMesh.

  • Reworked max.experimental.functional from a single functional.py into a functional/ package, a new distribution-and mesh-aware dispatch layer on top of the graph-compiler Python API, split cleanly into three op categories: creation_ops (tensor factories), spmd_ops (rule-based per-op SPMD dispatch), and collective_ops (allreduce_sum, allgather, reduce_scatter etc., now applied per device-group along a chosen mesh axis so they dispatch correctly on multi-dimensional meshes, plus a transfer_to convenience op between DeviceMappings).

  • Added max.experimental.sharding with the core types for distributed tensors (DeviceMesh; DeviceMapping with PlacementMapping and NamedMapping; placement primitives Replicated / Sharded / Partial; DistributedTensorType / DistributedBufferType; TensorLayout), plus a sharding.rules submodule of pure mapping-propagation rules (elementwise, matmul, reduction, shape, conv, pooling) that, for each op, either error out or reshard inputs to the proposed DeviceMappings and derive the resulting output DeviceMapping.

  • max.experimental.nn.Module.compile() now accepts DistributedTensorType symbolic inputs (not just TensorType), so distributed models can be built via the graph-compilation path in addition to running eagerly; gemma3_modulev3 is the first multi-GPU model wired up. DTensor support in MAX is still ongoing work and these APIs may evolve.

  • Added new graph ops (with matching max.experimental.functional wrappers): scatter_max, scatter_min, scatter_mul, scatter_nd_max, scatter_nd_min, scatter_nd_mul, non_maximum_suppression, resize_linear, resize_nearest, and resize_bicubic. The existing max.graph.ops.resize now delegates to these for BILINEAR, NEAREST, and BICUBIC interpolation modes. max.graph.ops.pad (and the functional wrapper) also accepts mode='reflect' and mode='edge' in addition to mode='constant'.

  • Expanded experimental eager-interpreter coverage so significantly more graphs run end-to-end without falling back to compilation. Added handlers for gather, gather_nd, argmax, argmin, split, scatter, scatter_nd, scatter_nd_add, scatter_add, scatter_max, scatter_min, scatter_mul, scatter_nd_max, scatter_nd_min, scatter_nd_mul, tile, band_part, top_k, bottom_k, nonzero, non_maximum_suppression, pad (constant on CPU/GPU; reflect and edge on CPU), conv2d, conv2d_transpose, max_pool2d, avg_pool2d (floor and ceil mode), resize_linear, resize_nearest, resize_bicubic, mo.mutable.store, mo.mutable.store.slice, and the distributed collectives distributed.allreduce.sum, distributed.allgather, distributed.scatter, distributed.broadcast, and distributed.reducescatter.sum. Most run on both CPU and GPU; CPU-only handlers are noted as such.

  • Rewrote the eager-interpreter mo.mutable.store.slice handler to write slices via a device-side Mojo kernel instead of a host numpy round-trip. GPU buffers no longer round-trip Dβ†’Hβ†’D on every call, and bfloat16 and float8_* dtypes are now supported (float4_e2m1fn remains unsupported).

  • Added defensive eager-interpreter handlers for mo.shape.from_tensor, mo.index.to_tensor, mo.buffer.create, mo.buffer.transfer, and mo.gather_sum so eager runs no longer crash if these internal ops survive canonicalization.

  • Improved experimental eager-interpreter performance by enabling multi-threaded CPU execution and removing unnecessary GPU device synchronization between op dispatches.

  • Added max.nn.StackedLinear for QKV-style stacked projections, with a fused (stacked=True) and an unfused (stacked=False) layout. Unfused mode opts into a new Module._omit_module_attr_name flag, which drops the wrapper's own attribute name from descendant weight FQNs, so a self.qkv_proj = StackedLinear(names=["q_proj", "k_proj", "v_proj"], stacked=False) exposes weights at self_attn.q_proj.weight rather than self_attn.qkv_proj.q_proj.weight. This lets HuggingFace checkpoint names flow into models without per-architecture remapping in their weight_adapters.py.

  • Module.compile() now accepts a custom_extensions parameter for loading custom Mojo kernel libraries at graph construction time, fixing validation failures for kernels with struct-level parameters.

  • Fixed torch.compile(fullgraph=True) failing with an "Unsupported context manager" error when accessing CustomOpLibrary ops inside the compiled function. Ops are now eagerly compiled during library initialization.

  • Runtime and device graph performance:

    • Reduced device-graph launch overhead for single-graph models.
    • Parallelized device-graph instantiation and moved instantiation off the main execution threads.
    • Added parallel device-graph launches and a task-ID hint on AsyncRT algorithms.
    • Added a GPU health check during DeviceContext initialization.
    • Added NaN/Inf detection at compiled-region boundaries.
    • Improved Metal driver support with custom statuses and Metal log capture for Apple GPU print output.
    • Made CPUDeviceContext asynchronous and added enqueue_cpu_function / enqueue_cpu_range helpers for CPU kernel execution.
    • Auto-enabled device-graph capture for DeepSeek V3, Kimi, and Kimi K2.5 serving paths.

Custom ops​

  • Added host-function and in-place memcpy custom ops, including mo.launch_host_func, mo.inplace_memcpy, an enqueueHostFunc Mojo binding on DeviceStream, and a cuLaunchHostFunc binding for the CUDA device stream.

MAX kernels​

  • Added GPU kernel examples from the Programming Massively Parallel Processors (PMPP) textbook covering reductions, scans, histograms, sorting, sparse matrix operations, graph algorithms, convolutions, FlashAttention, and more.

  • Improved NVFP4 grouped matmul kernel performance, now outperforming FlashInfer across all tested decoding and prefill shapes for Kimi K2.5 on B200.

  • Optimized GPU layer_norm kernels with SIMD reductions, gamma/beta prefetch, and a simd_width*2 warp tiling dispatch path.

  • Optimized GPU pad_constant kernel with SIMD vectorization (simd_width=4) and added a kbench benchmark suite (bench_pad).

  • Improved GPU topk and argsort kernel performance by nearly 2x.

  • Optimized GPU concat with a flat-indexing kernel that avoids multi-dimensional index decomposition, using 128-bit vectorized loads with automatic fallback for unaligned shapes.

  • Optimized GPU topk stage-1 kernel with a per-thread register heap that caches the top-8 elements during a single scan pass, eliminating redundant global memory re-reads for the first 8 extraction iterations.

  • Moved partial_simd_load and partial_simd_store from buffer.buffer to linalg.utils and removed the buffer package. Update imports from from buffer.buffer import ... to from linalg.utils import ....

  • Blackwell (SM100) GPU performance:

    • Enabled the Mojo SM100 GEMM by default.
    • Added MXFP4 and MXFP8 block-scaled matmul on SM100, plus a KIND_MXF4 execution path.
    • Added a general grouped block-scaled matmul dispatch and MXFP4 support for the grouped path.
    • Enabled PDL for SM100 grouped NVFP4 / MXFP4 / MXFP8 GMM.
    • Improved the SM100 GEMV dispatcher and added GEMV split-K for GEMMs with small M and N.
    • Increased the SM100 GEMM C-tile N dispatch up to 64.
  • AMD GPU performance:

    • Added B300 support, including device-agnostic default block counts for allreduce and allgather.
    • Added a CDNA4 block-scaled MFMA wrapper.
    • Added MI355X TileTensor MHA (about +13% prefill at depth 128) and TileTensor-based AMD attention kernels generally.
    • Always enabled the gfx950 MHA prefill kernel and modernized AMD MHA/MLA decode with 16x16 MMA and FP8.
    • Added depth-512 paths for AMD RDNA GPUs and a 2-D convolution kernel for RDNA 3+ GPUs.
    • Added MXFP4 matmul and grouped matmul support on AMD.
  • Attention and state-space kernels:

    • Added sparse MLA decode (with qbf16 / FP8 KV variants) for SM100.
    • Added speculative-decoding sequence-length folding with numhead for the TP MLA decode dispatch.
    • Added gated delta-rule recurrence kernels for hybrid-attention models.
  • Expert-parallel (EP) kernels:

    • Added multi-device MO ops for EP dispatch and combine.
    • Added a grouped dynamic NVFP4 quantization kernel for MoE.
    • Added MXFP4 support to ep.dispatch and the mo.distributed.ep.dispatch.mxfp4 op.
    • Added a skip_a2a mode to EP dispatch and combine.
    • Fixed AMD GPU atomics in EP dispatch.
  • Collective communication kernels:

    • Unified the multimem and standard code paths in ReduceScatter.
    • Enabled PDL for allgather and updated ReduceScatter to use with_PDL().
    • Launched allgather kernels in parallel and set the allgather block count via a tuning table.
    • Added support for non-multiples of SIMD width in allreduce.
  • Fused transformer kernels:

    • Added a fused rope_split_store kernel and wired it into AttentionWithRope.
    • Added a fused RMSNorm + RoPE GPU kernel and a graph-compiler fusion pattern for mo.reduce.rms_norm.RoPE.
    • Added a GEMV + partial RMSNorm fusion path.

Breaking changes​

  • Removed individual KV connector CLI flags (--host-kvcache-swap-space-gb, --disk-offload-dir, --disk-offload-max-gb, --disk-offload-direct-io, --lmcache-config-file). Use --kv-connector-config with a JSON dict instead.

  • max/python/max/benchmark/benchmark_throughput.py has been deprecated and will be removed in a future MAX release.

  • Removed Dim and DimList types from buffer.dimlist. Custom kernel code using these types should migrate to IntTuple and TileLayout from the layout package.

  • Removed PreTrainedPipelineTokenizer. Use the standard pipeline tokenizer resolution path instead.

  • Moved DenoisingCacheConfig from PipelineConfig to PipelineRuntimeConfig. Update call sites that constructed PipelineConfig(denoising_cache_config=...) to set the field on PipelineRuntimeConfig instead.

  • Replaced FluxPipelineOutput and Flux2PipelineOutput with a unified DiffusionPipelineOutput. Code that imports the old output types must switch to DiffusionPipelineOutput.

  • PipelineConfig now expects a models=ModelManifest(...) configuration for multi-component pipelines (transformer, VAE, text encoders, etc.). Pipelines that previously passed individual model paths or configs at the top level must migrate to a ModelManifest.

  • max-serve now requires the MODULAR_MAX_SERVE_* prefix for environment overrides. Unprefixed environment variables previously honored by max-serve no longer apply.

Fixed​

  • Fixed MAX tools aborting at startup with std::filesystem::filesystem_error when $HOME is not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412)

  • Fixed enqueue_fill() taking O(N) HIP API calls for float64 buffers on AMD GPUs when the high and low 32-bit halves of the fill value differ (e.g., 2.0), reducing the call count to O(log N). (Issue #6417)

  • Fixed integer indexing into a graph tensor (e.g. x[0] on a (2, 3) tensor) failing graph compilation with 'mo.static.reshape' op input and output elements do not match. A reshape-through-slice optimization pattern was incorrectly rewriting the slice + squeeze pattern produced by integer indexing, generating a reshape whose element count did not match the input. (Issue #6440)

Mojo language​

For all the updates to the Mojo language, standard library, and tools, see the Mojo release notes

Was this page helpful?