MAX v26.3
Highlightsβ
-
MAX now supports video generation with Wan 2.1 / 2.2 diffusion models, including image-to-video and video-to-video pipelines.
-
New API for multi-GPU model execution from Python: the
max.experimental.shardingmodule lets a singleModule.compile()call distribute a model across aDeviceMeshusingReplicated,Sharded, andPartialplacement primitives. Gemma 3 ModuleV3 is the first multi-GPU model on this path. -
The MAX NVFP4 grouped matmul kernel now outperforms FlashInfer on B200 across all tested decoding and prefill shapes for Kimi K2.5.
Documentationβ
- Restructured the MAX LLM book around how to
deploy a custom model with
max serve. - Added new model developer guides covering broadcasting, indexing, and the model bring-up workflow.
- Added a graph overview and a new graph and modules guide.
- Added model debugging guides for accuracy, errors, GPU, and tracing.
- Updated the speculative decoding guide.
- Updated the guide to serve custom models.
- Added API docs for
max.pipelines.architectures. - Redesigned REST API reference, now built with Scalar.
MAX modelsβ
-
The
residual_thresholdparameter for FLUX first-block cache (FBCache) is now a per-request runtime parameter onImageProviderOptions, allowing it to be tuned without recompiling the model graph. -
Added the Mamba state space model architecture.
-
Added the Step-3.5-Flash architecture.
-
Added the Qwen-Image and Qwen-Image-Edit text-to-image architectures.
-
Added the Z-Image and Z-Image-Turbo text-to-image architectures.
-
MiniMax-M2 and MiniMax-M2.7:
- Added MiniMax-M2 and MiniMax-M2.7 architecture support, including FP8 weights, the lightning-attention hybrid backbone, and 4ΓH100 multi-GPU serving.
- Enabled DP+EP execution paths for MiniMax MoE layers, with automatic overlap scheduling and device-graph capture.
- Added per-rank token-limit checks and reduced input-offset device round trips on the MiniMax decode path.
-
Gemma 4 and Gemma 3 ModuleV3:
- Added the Gemma 4 architecture (ModuleV2), including multimodal vision support.
- Added the Gemma 3 ModuleV3 implementation with multi-GPU support via
the DTensor /
DistributedTensorTypecompile path. - Fixed token-offset and prompt-image alignment regressions in Gemma 4 multimodal prefill, plus assorted Gemma 3 ModuleV3 performance fixes.
-
Qwen3 and Qwen3-VL:
- Added Qwen3 and Qwen3-VL architecture support, including the MoE variant and multimodal vision input.
-
Wan video diffusion:
- Fixed Wan 2.1 / 2.2 video diffusion pipelines silently running without
classifier-free guidance. The tokenizer gated negative-prompt tokenization
on
true_cfg_scale > 1.0(default1.0), so negative tokens were never produced and the executor fell back to unguided generation even whenguidance_scale > 1.0and a negative prompt were supplied. Wan now enables classical CFG wheneverguidance_scale > 1.0and defaults an absent negative prompt to the empty string, matching the diffusers baseline. - Added the UniPC multistep scheduler for Wan diffusion.
- Added Wan image-to-video and video-to-video pipeline variants, plus additional generation kwargs and prompt-handling fixes.
- Fixed Wan 2.1 / 2.2 video diffusion pipelines silently running without
classifier-free guidance. The tokenizer gated negative-prompt tokenization
on
-
FLUX.2:
- Added TaylorSeer denoising cache support to the FLUX.2 Klein pipeline, enabling significant speedups for image-to-image generation by skipping redundant transformer passes during the denoising loop.
- Added TeaCache support to
DiffusionPipelineas a peer of TaylorSeer. - Added FLUX.2 ModuleV2 pipeline, FLUX.2 Klein support, NVFP4 quantization, aspect-ratio preserving image preprocessing, and BFL checkpoint weight fixes.
-
Kimi K2.5 vision:
- Improved Kimi K2.5 multimodal support, including vision encoder fixes and tokenizer parity with the upstream model.
-
DeepSeek V3 and Kimi K2.5 distributed execution:
- Improved tensor-parallel and expert-parallel execution paths for DeepSeek V3 and Kimi K2.5, including subgraph deduplication, MoE dispatch tuning, and reduced compile-time overhead.
MAX frameworkβ
Inference serverβ
-
Added periodic "still building/compiling" log messages during model compilation so that long operations produce visible signs of progress.
-
Consolidated KV connector CLI flags (
--host-kvcache-swap-space-gb,--disk-offload-dir,--disk-offload-max-gb,--disk-offload-direct-io,--lmcache-config-file) into the--kv-connector-configJSON dict. -
Removed the
--allow-safetensors-weights-fp32-bf16-bidirectional-castCLI flag. Float32 <-> bfloat16 safetensors weight casting is now unconditionally enabled. -
Added
--model-overrideCLI flag for per-componentModelManifestoverrides (e.g.--model-override transformer.quantization_encoding=float4_e2m1fnx2), enabling mixed quantization in diffusion pipelines. -
Removed jump forward decoding (
compute_ff_tokens) from structured output. The bitmask constraint alone ensures valid structured output, matching the approach used by vLLM and SGLang. -
Added
json_objectresponse-format support to MAX Serve structured output via/v1/chat/completions. -
Improved error handling for image request failures in MAX Serve.
-
Added multi-step and overlap-scheduler support for structured output in the
TextGenerationPipeline. Extended tokenizer support to include TikToken-based tokenizers, enabling structured output with Kimi K2.5. -
Improved cached-token reporting, fixed cache hit/miss metrics to emit only on context-encoding batches, moved a subset of telemetry from detailed to basic, and added per-draft-position acceptance-rate logging for speculative decoding.
-
Tightened the
MODULAR_MAX_SERVE_*environment-variable prefix; unprefixed overrides previously honored bymax-serveno longer apply. -
Added
min_pandtop_ksampling controls and additional chat-completion kwargs to the OpenAI-compatible routes. -
Unified EAGLE speculative decoding:
- Added unified EAGLE pipelines for Llama 3, DeepSeek V3 + MTP, and Kimi K2.5, sharing a single PipelineModel.
- Added support for
--num-speculative-tokens > 1across the unified EAGLE Llama, DeepSeek+MTP, and Kimi+EAGLE paths. - Added overlap-scheduler support for unified EAGLE, including multi-GPU DP setups (e.g. DP Kimi).
- Enabled CUDA graphs for EAGLE and MTP.
-
Distributed KV transfer (dKV):
- Added the
DKVConnectorwith NIXL transfer support for the distributed KV cache. - Unified KV connector configuration under
--kv-connector-config. - Added EFA compatibility, disconnect support, parent-hash eviction, and per-connector metrics for the dKV transfer engine.
- Added a configurable decode-stall watchdog for 1P1D deployments.
- Added disk-location support to the Python dKV client.
- Added the
-
Heterogeneous serving and overlap scheduling:
- Added two-phase prefill execution under the overlap scheduler for the distributed-inference (DI) prefill role.
- Auto-enabled overlap scheduling for DI pipeline roles and disabled auto device-graph capture for prefill-only workers.
- Added support for heterogeneous TP prefill / DP decode in MLA KV
transfer (e.g.
tp4prefill into a DP decode pool).
max CLIβ
- Added sweep benchmarking capabilities to
max benchmark: iterate over multiple concurrency and request-rate combinations, flush the prefix cache between runs, and collect per-run structured JSON results. - Standardized the
--modelflag acrossmax serve,max generate,max encode, andmax warm-cache. - Improved
max serveCLI flag descriptions.
Python APIβ
-
Added
Model.release_captured_graph(), which drops a previously captured device graph identified by graph key (or per-device keys) and frees its device-side working memory once any in-flight replay completes. Releasing a key that was never captured is a no-op. Callers remain responsible for dropping any outputBufferhandles returned by the correspondingModel.capture()call. -
Added
ops.roi_align(withF.roi_alignfunctional wrapper) for ROI Align pooling over NHWC inputs, with configurable spatial scale, sampling ratio, alignment mode, and AVG/MAX pooling. Includes a matching MO eager handler. -
Added MO eager handlers for
ConstantExternalOp,ConstantScalarOp,ReduceRmsNormOp, andReduceGroupNormOp, so graphs with external weights, scalar constants, RMS norm, or group norm run eagerly without falling back to compilation. -
Fixed tensor slicing with negative integer indices (e.g.
hidden[:, -1]) which previously raised aRuntimeErrorat compile time. -
Fixed
ops.reshape/TensorValue.reshaperejecting valid-1reshapes on tensors whose leading dim is a symbolic sum-of-products (e.g.[(batch_size * num_steps) + total_seq_len, 1536]reshaped to[-1, n_heads, head_dim]withn_heads * head_dim == 1536). The inferred dim now simplifies without requiring arebind. -
Setting
MODULAR_MAX_DEBUG_UNINITIALIZED_READ_CHECK=true(or themax-debug.uninitialized-read-checkconfig key, orInferenceSession.debug.uninitialized_read_check = True) enables detection of uninitialized memory reads in Mojo kernels.InferenceSessionautomatically enables the debug allocator poison and compiles kernels with load-time poison checks for all float types. When a load matches a poison pattern, the process aborts with a descriptive message. -
Added support for the
bfloat16data type on ARM CPU devices in MAX graphs. Previously,session.load()raised aValueErrorwhen a graph contained bf16 tensors targeting an ARM CPU. -
Added
DevicePlacementPolicy(Ignore,Warn,Error) toGraphto control behavior when CPU-only ops (ops.scatter,ops.cumsum,ops.nonzero,ops.tile) receive GPU tensors. The default (Warn) emits aUserWarningand falls back to CPU;ErrorraisesValueErrorinstead.ops.condandops.while_loopalways raiseValueErrorfor GPU predicates. -
Fixed slow
axis=Nonereductions (mean,sum,prod,max,min) inmax.experimental.functional. The previous implementation flattened the tensor before reducing, serializing the work onto a single GPU block. Reductions now iterate axis-by-axis to preserve parallelism. -
Renamed the public quantization APIs from
Float8*toQuant*(includingFloat8ConfigβQuantConfig,parse_float8_config()βparse_quant_config(), and thequantmodules inmax.nnandmax.pipelines.lib), reflecting that the config now covers FP8, NVFP4, and MXFP4 quantization. -
max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured. -
Introduced
CPUMetricsalongside the existing GPU diagnostics and open source it under frommax.diagnostics. -
Added
Model.kernel_summariesfor inspecting compiled kernels through the Python API. -
Added a unified
DebugConfigPython class (with nanobind bindings) and exposedDebugConfigandGraphDebugConfiginmax.engineandmax.graph. -
Added a graph API for initializing and registering the runtime context (
M::Context) from Python. -
Improved
max.experimental.functional.custom: compiled custom-op kernels are now cached, and eager-modeF.customno longer recompiles on every call. -
Fixed
Module.compile()when unrealized tensors are used as weights. -
Added the
InputModalityenum for specifying model input types and threaded it through the multimodal pipeline architectures. -
Updated
Tensor.to()andModule.to()to accept distributed device targets, includingDeviceMappingandDeviceMesh. -
max.experimental.Tensoris now distribution-aware: it carries a tuple of per-shard storages,driver.Buffers (realized) or graph values (TensorValue/BufferValue, unrealized), paired with aDeviceMappingthat maps those local shards onto theDeviceMesh. -
Reworked
max.experimental.functionalfrom a singlefunctional.pyinto afunctional/package, a new distribution-and mesh-aware dispatch layer on top of the graph-compiler Python API, split cleanly into three op categories:creation_ops(tensor factories),spmd_ops(rule-based per-op SPMD dispatch), andcollective_ops(allreduce_sum,allgather,reduce_scatteretc., now applied per device-group along a chosen mesh axis so they dispatch correctly on multi-dimensional meshes, plus atransfer_toconvenience op betweenDeviceMappings). -
Added
max.experimental.shardingwith the core types for distributed tensors (DeviceMesh;DeviceMappingwithPlacementMappingandNamedMapping; placement primitivesReplicated/Sharded/Partial;DistributedTensorType/DistributedBufferType;TensorLayout), plus asharding.rulessubmodule of pure mapping-propagation rules (elementwise, matmul, reduction, shape, conv, pooling) that, for each op, either error out or reshard inputs to the proposedDeviceMappings and derive the resulting outputDeviceMapping. -
max.experimental.nn.Module.compile()now acceptsDistributedTensorTypesymbolic inputs (not justTensorType), so distributed models can be built via the graph-compilation path in addition to running eagerly;gemma3_modulev3is the first multi-GPU model wired up. DTensor support in MAX is still ongoing work and these APIs may evolve. -
Added new graph ops (with matching
max.experimental.functionalwrappers):scatter_max,scatter_min,scatter_mul,scatter_nd_max,scatter_nd_min,scatter_nd_mul,non_maximum_suppression,resize_linear,resize_nearest, andresize_bicubic. The existingmax.graph.ops.resizenow delegates to these forBILINEAR,NEAREST, andBICUBICinterpolation modes.max.graph.ops.pad(and the functional wrapper) also acceptsmode='reflect'andmode='edge'in addition tomode='constant'. -
Expanded experimental eager-interpreter coverage so significantly more graphs run end-to-end without falling back to compilation. Added handlers for
gather,gather_nd,argmax,argmin,split,scatter,scatter_nd,scatter_nd_add,scatter_add,scatter_max,scatter_min,scatter_mul,scatter_nd_max,scatter_nd_min,scatter_nd_mul,tile,band_part,top_k,bottom_k,nonzero,non_maximum_suppression,pad(constant on CPU/GPU; reflect and edge on CPU),conv2d,conv2d_transpose,max_pool2d,avg_pool2d(floor and ceil mode),resize_linear,resize_nearest,resize_bicubic,mo.mutable.store,mo.mutable.store.slice, and the distributed collectivesdistributed.allreduce.sum,distributed.allgather,distributed.scatter,distributed.broadcast, anddistributed.reducescatter.sum. Most run on both CPU and GPU; CPU-only handlers are noted as such. -
Rewrote the eager-interpreter
mo.mutable.store.slicehandler to write slices via a device-side Mojo kernel instead of a host numpy round-trip. GPU buffers no longer round-trip DβHβD on every call, andbfloat16andfloat8_*dtypes are now supported (float4_e2m1fnremains unsupported). -
Added defensive eager-interpreter handlers for
mo.shape.from_tensor,mo.index.to_tensor,mo.buffer.create,mo.buffer.transfer, andmo.gather_sumso eager runs no longer crash if these internal ops survive canonicalization. -
Improved experimental eager-interpreter performance by enabling multi-threaded CPU execution and removing unnecessary GPU device synchronization between op dispatches.
-
Added
max.nn.StackedLinearfor QKV-style stacked projections, with a fused (stacked=True) and an unfused (stacked=False) layout. Unfused mode opts into a newModule._omit_module_attr_nameflag, which drops the wrapper's own attribute name from descendant weight FQNs, so aself.qkv_proj = StackedLinear(names=["q_proj", "k_proj", "v_proj"], stacked=False)exposes weights atself_attn.q_proj.weightrather thanself_attn.qkv_proj.q_proj.weight. This lets HuggingFace checkpoint names flow into models without per-architecture remapping in theirweight_adapters.py. -
Module.compile()now accepts acustom_extensionsparameter for loading custom Mojo kernel libraries at graph construction time, fixing validation failures for kernels with struct-level parameters. -
Fixed
torch.compile(fullgraph=True)failing with an "Unsupported context manager" error when accessingCustomOpLibraryops inside the compiled function. Ops are now eagerly compiled during library initialization. -
Runtime and device graph performance:
- Reduced device-graph launch overhead for single-graph models.
- Parallelized device-graph instantiation and moved instantiation off the main execution threads.
- Added parallel device-graph launches and a task-ID hint on AsyncRT algorithms.
- Added a GPU health check during
DeviceContextinitialization. - Added NaN/Inf detection at compiled-region boundaries.
- Improved Metal driver support with custom statuses and Metal log capture for Apple GPU print output.
- Made
CPUDeviceContextasynchronous and addedenqueue_cpu_function/enqueue_cpu_rangehelpers for CPU kernel execution. - Auto-enabled device-graph capture for DeepSeek V3, Kimi, and Kimi K2.5 serving paths.
Custom opsβ
- Added host-function and in-place memcpy custom ops, including
mo.launch_host_func,mo.inplace_memcpy, anenqueueHostFuncMojo binding onDeviceStream, and acuLaunchHostFuncbinding for the CUDA device stream.
MAX kernelsβ
-
Added GPU kernel examples from the Programming Massively Parallel Processors (PMPP) textbook covering reductions, scans, histograms, sorting, sparse matrix operations, graph algorithms, convolutions, FlashAttention, and more.
-
Improved NVFP4 grouped matmul kernel performance, now outperforming FlashInfer across all tested decoding and prefill shapes for Kimi K2.5 on B200.
-
Optimized GPU
layer_normkernels with SIMD reductions, gamma/beta prefetch, and asimd_width*2warp tiling dispatch path. -
Optimized GPU
pad_constantkernel with SIMD vectorization (simd_width=4) and added a kbench benchmark suite (bench_pad). -
Improved GPU
topkandargsortkernel performance by nearly 2x. -
Optimized GPU
concatwith a flat-indexing kernel that avoids multi-dimensional index decomposition, using 128-bit vectorized loads with automatic fallback for unaligned shapes. -
Optimized GPU
topkstage-1 kernel with a per-thread register heap that caches the top-8 elements during a single scan pass, eliminating redundant global memory re-reads for the first 8 extraction iterations. -
Moved
partial_simd_loadandpartial_simd_storefrombuffer.buffertolinalg.utilsand removed thebufferpackage. Update imports fromfrom buffer.buffer import ...tofrom linalg.utils import .... -
Blackwell (SM100) GPU performance:
- Enabled the Mojo SM100 GEMM by default.
- Added MXFP4 and MXFP8 block-scaled matmul on SM100, plus a
KIND_MXF4execution path. - Added a general grouped block-scaled matmul dispatch and MXFP4 support for the grouped path.
- Enabled PDL for SM100 grouped NVFP4 / MXFP4 / MXFP8 GMM.
- Improved the SM100 GEMV dispatcher and added GEMV split-K for GEMMs with
small
MandN. - Increased the SM100 GEMM
C-tileNdispatch up to 64.
-
AMD GPU performance:
- Added B300 support, including device-agnostic default block counts for allreduce and allgather.
- Added a CDNA4 block-scaled MFMA wrapper.
- Added MI355X TileTensor MHA (about +13% prefill at depth 128) and TileTensor-based AMD attention kernels generally.
- Always enabled the gfx950 MHA prefill kernel and modernized AMD MHA/MLA decode with 16x16 MMA and FP8.
- Added depth-512 paths for AMD RDNA GPUs and a 2-D convolution kernel for RDNA 3+ GPUs.
- Added MXFP4 matmul and grouped matmul support on AMD.
-
Attention and state-space kernels:
- Added sparse MLA decode (with qbf16 / FP8 KV variants) for SM100.
- Added speculative-decoding sequence-length folding with
numheadfor the TP MLA decode dispatch. - Added gated delta-rule recurrence kernels for hybrid-attention models.
-
Expert-parallel (EP) kernels:
- Added multi-device MO ops for EP dispatch and combine.
- Added a grouped dynamic NVFP4 quantization kernel for MoE.
- Added MXFP4 support to
ep.dispatchand themo.distributed.ep.dispatch.mxfp4op. - Added a
skip_a2amode to EP dispatch and combine. - Fixed AMD GPU atomics in EP dispatch.
-
Collective communication kernels:
- Unified the multimem and standard code paths in
ReduceScatter. - Enabled PDL for allgather and updated
ReduceScatterto usewith_PDL(). - Launched allgather kernels in parallel and set the allgather block count via a tuning table.
- Added support for non-multiples of SIMD width in allreduce.
- Unified the multimem and standard code paths in
-
Fused transformer kernels:
- Added a fused
rope_split_storekernel and wired it intoAttentionWithRope. - Added a fused RMSNorm + RoPE GPU kernel and a graph-compiler fusion
pattern for
mo.reduce.rms_norm.RoPE. - Added a GEMV + partial RMSNorm fusion path.
- Added a fused
Breaking changesβ
-
Removed individual KV connector CLI flags (
--host-kvcache-swap-space-gb,--disk-offload-dir,--disk-offload-max-gb,--disk-offload-direct-io,--lmcache-config-file). Use--kv-connector-configwith a JSON dict instead. -
max/python/max/benchmark/benchmark_throughput.pyhas been deprecated and will be removed in a future MAX release. -
Removed
DimandDimListtypes frombuffer.dimlist. Custom kernel code using these types should migrate toIntTupleandTileLayoutfrom thelayoutpackage. -
Removed
PreTrainedPipelineTokenizer. Use the standard pipeline tokenizer resolution path instead. -
Moved
DenoisingCacheConfigfromPipelineConfigtoPipelineRuntimeConfig. Update call sites that constructedPipelineConfig(denoising_cache_config=...)to set the field onPipelineRuntimeConfiginstead. -
Replaced
FluxPipelineOutputandFlux2PipelineOutputwith a unifiedDiffusionPipelineOutput. Code that imports the old output types must switch toDiffusionPipelineOutput. -
PipelineConfignow expects amodels=ModelManifest(...)configuration for multi-component pipelines (transformer, VAE, text encoders, etc.). Pipelines that previously passed individual model paths or configs at the top level must migrate to aModelManifest. -
max-servenow requires theMODULAR_MAX_SERVE_*prefix for environment overrides. Unprefixed environment variables previously honored bymax-serveno longer apply.
Fixedβ
-
Fixed MAX tools aborting at startup with
std::filesystem::filesystem_errorwhen$HOMEis not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412) -
Fixed
enqueue_fill()taking O(N) HIP API calls forfloat64buffers on AMD GPUs when the high and low 32-bit halves of the fill value differ (e.g.,2.0), reducing the call count to O(log N). (Issue #6417) -
Fixed integer indexing into a graph tensor (e.g.
x[0]on a(2, 3)tensor) failing graph compilation with'mo.static.reshape' op input and output elements do not match. A reshape-through-slice optimization pattern was incorrectly rewriting the slice + squeeze pattern produced by integer indexing, generating a reshape whose element count did not match the input. (Issue #6440)
Mojo languageβ
For all the updates to the Mojo language, standard library, and tools, see the Mojo release notes
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!