Nightly: v26.3
This version is still a work in progress.
Highlights
Documentation
MAX models
- The
residual_thresholdparameter for FLUX first-block cache (FBCache) is now a per-request runtime parameter onImageProviderOptions, allowing it to be tuned without recompiling the model graph. - Added TaylorSeer denoising cache support to the FLUX.2 Klein pipeline, enabling significant speedups for image-to-image generation by skipping redundant transformer passes during the denoising loop.
- Added the Mamba state space model architecture.
- Fixed Wan 2.1 / 2.2 video diffusion pipelines silently running without
classifier-free guidance. The tokenizer gated negative-prompt tokenization
on
true_cfg_scale > 1.0(default1.0), so negative tokens were never produced and the executor fell back to unguided generation even whenguidance_scale > 1.0and a negative prompt were supplied. Wan now enables classical CFG wheneverguidance_scale > 1.0and defaults an absent negative prompt to the empty string, matching the diffusers baseline.
MAX framework
Inference server
- Added periodic "still building/compiling" log messages during model compilation so that long operations produce visible signs of progress.
- Consolidated KV connector CLI flags (
--host-kvcache-swap-space-gb,--disk-offload-dir,--disk-offload-max-gb,--disk-offload-direct-io,--lmcache-config-file) into the--kv-connector-configJSON dict. - Removed the
--allow-safetensors-weights-fp32-bf16-bidirectional-castCLI flag. Float32 <-> bfloat16 safetensors weight casting is now unconditionally enabled. - Added
--model-overrideCLI flag for per-componentModelManifestoverrides (e.g.--model-override transformer.quantization_encoding=float4_e2m1fnx2), enabling mixed quantization in diffusion pipelines.
max CLI
- Added sweep benchmarking capabilities to
max benchmark: iterate over multiple concurrency and request-rate combinations, flush the prefix cache between runs, and collect per-run structured JSON results.
Python API
-
Added
ops.roi_aligngraph op andF.roi_alignfunctional wrapper for ROI Align pooling over NHWC inputs with configurable spatial scale, sampling ratio, alignment mode, and AVG/MAX pooling. -
Added
roi_alignop handler to the MO eager interpreter, enabling eager-mode execution of ROI Align pooling without graph compilation. -
Added
ConstantExternalOpandConstantScalarOphandlers to the MO eager interpreter, allowing graphs with external weights and scalar constants to run without falling back to full compilation. -
Added
ReduceRmsNormOphandler to the MO eager interpreter, enabling eager-mode execution of RMS normalization without graph compilation. -
Added
ReduceGroupNormOphandler to the MO eager interpreter, enabling eager-mode execution of group normalization without graph compilation. -
Fixed tensor slicing with negative integer indices (e.g.
hidden[:, -1]) which previously raised aRuntimeErrorat compile time. -
Setting
MODULAR_MAX_DEBUG_UNINITIALIZED_READ_CHECK=true(or themax-debug.uninitialized-read-checkconfig key, orInferenceSession.debug.uninitialized_read_check = True) enables detection of uninitialized memory reads in Mojo kernels.InferenceSessionautomatically enables the debug allocator poison and compiles kernels with load-time poison checks for all float types. When a load matches a poison pattern, the process aborts with a descriptive message. -
Added support for the
bfloat16data type on ARM CPU devices in MAX graphs. Previously,session.load()raised aValueErrorwhen a graph contained bf16 tensors targeting an ARM CPU. -
Added
DevicePlacementPolicy(Ignore,Warn,Error) toGraphto control behavior when CPU-only ops (ops.scatter,ops.cumsum,ops.nonzero,ops.tile) receive GPU tensors. The default (Warn) emits aUserWarningand falls back to CPU;ErrorraisesValueErrorinstead.ops.condandops.while_loopalways raiseValueErrorfor GPU predicates. -
Fixed slow
axis=Nonereductions (mean,sum,prod,max,min) inmax.experimental.functional. The previous implementation flattened the tensor before reducing, serializing the work onto a single GPU block. Reductions now iterate axis-by-axis to preserve parallelism. -
Renamed
Float8ConfigtoQuantConfig(and related types/functions) to reflect that the config now covers FP8, NVFP4, and MXFP4 quantization. -
Renamed related public Python quantization APIs from
Float8*names toQuant*names, includingparse_float8_config()toparse_quant_config(), and the publicquantmodules inmax.nnandmax.pipelines.lib. -
max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured. -
Introduced
CPUMetricsalongside the existing GPU diagnostics and open source it under frommax.diagnostics. -
max.experimental.Tensoris now distribution-aware: it carries a tuple of per-shard storages,driver.Buffers (realized) or graph values (TensorValue/BufferValue, unrealized), paired with aDeviceMappingthat maps those local shards onto theDeviceMesh. -
Reworked
max.experimental.functionalfrom a singlefunctional.pyinto afunctional/package, a new distribution-and mesh-aware dispatch layer on top of the graph-compiler Python API, split cleanly into three op categories:creation_ops(tensor factories),spmd_ops(rule-based per-op SPMD dispatch), andcollective_ops(allreduce_sum,allgather,reduce_scatteretc., now applied per device-group along a chosen mesh axis so they dispatch correctly on multi-dimensional meshes, plus atransfer_toconvenience op betweenDeviceMappings). -
Added
max.experimental.shardingwith the core types for distributed tensors (DeviceMesh;DeviceMappingwithPlacementMappingandNamedMapping; placement primitivesReplicated/Sharded/Partial;DistributedTensorType/DistributedBufferType;TensorLayout), plus asharding.rulessubmodule of pure mapping-propagation rules (elementwise, matmul, reduction, shape, conv, pooling) that, for each op, either error out or reshard inputs to the proposedDeviceMappings and derive the resulting outputDeviceMapping. -
max.experimental.nn.Module.compile()now acceptsDistributedTensorTypesymbolic inputs (not justTensorType), so distributed models can be built via the graph-compilation path in addition to running eagerly;gemma3_modulev3is the first multi-GPU model wired up. DTensor support in MAX is still ongoing work and these APIs may evolve. -
Improved experimental eager interpreter performance by enabling multi-threaded CPU execution and removing unnecessary GPU device synchronization after each op dispatch.
-
Added
gatherandgather_ndop handlers to the experimental eager interpreter with full CPU and GPU support. -
Added
argmaxandargminop handlers to the experimental eager interpreter with full CPU and GPU support, returning int64 indices along a specified axis. -
Added
splitop handler to the experimental eager interpreter with full CPU and GPU support, splitting a tensor into multiple outputs along a specified axis. -
Added
scatterop handler to the experimental eager interpreter (CPU), scattering updates into a copy of the input tensor along a specified axis. -
Added
scatter_ndop handler to the experimental eager interpreter (CPU and GPU), scattering slices from updates into input at N-dimensional index positions viamax.experimental.functional.scatter_nd. -
Added
scatter_nd_addop handler to the experimental eager interpreter (CPU), accumulating slices from updates into input at N-dimensional index positions and summing duplicate indices viamax.experimental.functional.scatter_nd_add. -
Added
conv2dandconv2d_transposeop handlers to the experimental eager interpreter with CPU and GPU support. -
Added
max_pool2dop handlers (floor and ceil mode) to the experimental eager interpreter with CPU and GPU support. -
Added
tileop handler to the experimental eager interpreter (CPU), repeating the input tensor along each dimension. -
Added
band_partop handler to the experimental eager interpreter with CPU and GPU support, masking tensor matrices based on a diagonal band. -
Added
avg_pool2dop handlers (floor and ceil mode) to the experimental eager interpreter with CPU and GPU support. -
Added
top_kop handler to the experimental eager interpreter with CPU and GPU support, returning the top-k values and their original indices along a specified axis. -
Added
bottom_kop handler to the experimental eager interpreter with CPU and GPU support, returning the k smallest values and their original indices along a specified axis viamax.experimental.functional.bottom_k. -
Added
nonzeroop handler to the experimental eager interpreter (CPU), returning the row-major coordinates of all nonzero elements as a[nnz, rank]int64 tensor viamax.experimental.functional.nonzero. -
Added
scatter_addop handler to the experimental eager interpreter (CPU), accumulatingupdatesinto a copy ofinputatindicesalongaxisand summing duplicate indices viamax.experimental.functional.scatter_add. -
Added
max.graph.ops.scatter_max,max.graph.ops.scatter_min, andmax.graph.ops.scatter_mulgraph operations (and correspondingmax.experimental.functionalwrappers) for element-wise scatter with max, min, and multiply reductions at duplicate indices along an axis. -
Added
scatter_max,scatter_min, andscatter_mulop handlers to the experimental eager interpreter (CPU), applying max, min, and multiply reductions at duplicate scatter indices viamax.experimental.functional.scatter_max,.scatter_min, and.scatter_mul. -
Added
max.graph.ops.scatter_nd_max,max.graph.ops.scatter_nd_min, andmax.graph.ops.scatter_nd_mulgraph operations (and correspondingmax.experimental.functionalwrappers) for N-dimensional scatter with max, min, and multiply reductions at duplicate index vectors. -
Added
scatter_nd_max,scatter_nd_min, andscatter_nd_mulop handlers to the experimental eager interpreter (CPU), applying max, min, and multiply reductions at duplicate N-dimensional scatter indices viamax.experimental.functional.scatter_nd_max,.scatter_nd_min, and.scatter_nd_mul. -
max.graph.ops.pad(andmax.graph.experimental.functional.pad) now acceptsmode='reflect'andmode='edge'in addition tomode='constant'. -
Added
padop handlers (pad.constant,pad.reflect,pad.repeat) to the experimental eager interpreter.pad.constantsupports CPU and GPU;pad.reflectandpad.repeat(edge padding) run on CPU. -
Added
max.graph.ops.resize_linearfor linear (bilinear) interpolation resizing with configurablecoordinate_transform_mode(half_pixel, align_corners, asymmetric, half_pixel_1D) and optionalantialiasdownscaling support;max.graph.ops.resizenow supportsInterpolationMode.BILINEARby delegating toresize_linear. -
Added
resize_linearop handler to the experimental eager interpreter (CPU) viamax.experimental.functional.resize_linear. -
Added
max.graph.ops.resize_nearestfor nearest-neighbor interpolation resizing with configurablecoordinate_transform_modeandround_mode;max.graph.ops.resizenow supportsInterpolationMode.NEAREST. -
Added
resize_nearestop handler to the experimental eager interpreter (CPU) viamax.experimental.functional.resize_nearest. -
Added
max.graph.ops.resize_bicubicfor bicubic interpolation resizing (rank-4 NCHW, half_pixel coord mapping, a=-0.75 Catmull-Rom kernel);max.graph.ops.resizenow delegates itsInterpolationMode.BICUBICpath toresize_bicubic. -
Added
resize_bicubicop handler to the experimental eager interpreter (CPU) viamax.experimental.functional.resize_bicubic. -
Added defensive
mo.shape.from_tensorandmo.index.to_tensorhandlers to the experimental eager interpreter. These internal ops are typically folded away by canonicalization; the handlers prevent crashes if they survive into the interpreter. -
Added defensive
mo.buffer.createandmo.buffer.transferhandlers to the experimental eager interpreter. These internal ops are typically lowered by the graph compiler; the handlers prevent crashes if they survive into the interpreter. -
Added
mo.mutable.storeandmo.mutable.store.slicehandlers to the experimental eager interpreter. These complement the existingmo.mutable.loadhandler and enable eager execution of in-place buffer writes (full-tensor stores and slice-indexed stores). -
Rewrote the eager-interpreter
mo.mutable.store.slicehandler to write slices via a device-side Mojo kernel instead of a host numpy round-trip. GPU buffers no longer full-buffer D→H→D on every call, andbfloat16andfloat8_*dtypes are now supported.float4_e2m1fnremains unsupported. -
Added defensive
mo.gather_sumhandler to the experimental eager interpreter. This fused composite op (gather axis 0 + sum axis 1) is used by DLRM-style multi-hot embeddings; the handler prevents crashes if the op survives into the interpreter. -
Added
distributed.allreduce.sumop handler to the experimental eager interpreter, enabling multi-GPU eager execution of allreduce collectives -
Added
distributed.allgatherop handler to the experimental eager interpreter, enabling multi-GPU eager execution of allgather collectives without falling back to compilation. -
Added
distributed.scatterop handler to the experimental eager interpreter, enabling multi-GPU eager execution of scatter collectives without falling back to compilation. -
Added
distributed.broadcastop handler to the eager interpreter, enabling multi-GPU eager execution of broadcast collectives without falling back to compilation. -
Added
non_maximum_suppressionop handler to the experimental eager interpreter (CPU), enabling NMS to run through the interpreter without falling back to compilation. -
Added
max.graph.ops.non_maximum_suppressiongraph operation (andmax.experimental.functional.non_maximum_suppressionwrapper) for constructing ONNX-style non-maximum suppression in MAX graphs. -
Added
distributed.reducescatter.sumop handler to the eager interpreter, enabling multi-GPU eager execution of reduce-scatter collectives without falling back to compilation. -
Added
max.nn.StackedLinearfor QKV-style stacked projections, with a fused (stacked=True) and an unfused (stacked=False) layout. Unfused mode opts into a newModule._omit_module_attr_nameflag, which drops the wrapper's own attribute name from descendant weight FQNs, so aself.qkv_proj = StackedLinear(names=["q_proj", "k_proj", "v_proj"], stacked=False)exposes weights atself_attn.q_proj.weightrather thanself_attn.qkv_proj.q_proj.weight. This lets HuggingFace checkpoint names flow into models without per-architecture remapping in theirweight_adapters.py. -
Module.compile()now accepts acustom_extensionsparameter for loading custom Mojo kernel libraries at graph construction time, fixing validation failures for kernels with struct-level parameters. -
Fixed
torch.compile(fullgraph=True)failing with an "Unsupported context manager" error when accessingCustomOpLibraryops inside the compiled function. Ops are now eagerly compiled during library initialization.
Breaking changes
-
Removed individual KV connector CLI flags (
--host-kvcache-swap-space-gb,--disk-offload-dir,--disk-offload-max-gb,--disk-offload-direct-io,--lmcache-config-file). Use--kv-connector-configwith a JSON dict instead. -
max/python/max/benchmark/benchmark_throughput.pyhas been deprecated and will be removed in a future MAX release. -
Removed
DimandDimListtypes frombuffer.dimlist. Custom kernel code using these types should migrate toIntTupleandTileLayoutfrom thelayoutpackage.
Mojo API
Custom ops
MAX kernels
-
Added GPU kernel examples from the Programming Massively Parallel Processors (PMPP) textbook covering reductions, scans, histograms, sorting, sparse matrix operations, graph algorithms, convolutions, FlashAttention, and more.
-
Improved NVFP4 grouped matmul kernel performance, now outperforming FlashInfer across all tested decoding and prefill shapes for Kimi K2.5 on B200.
-
Optimized GPU
layer_normkernels with SIMD reductions, gamma/beta prefetch, and asimd_width*2warp tiling dispatch path. -
Optimized GPU
pad_constantkernel with SIMD vectorization (simd_width=4) and added a kbench benchmark suite (bench_pad). -
Improved GPU
topkandargsortkernel performance by nearly 2x. -
Optimized GPU
concatwith a flat-indexing kernel that avoids multi-dimensional index decomposition, using 128-bit vectorized loads with automatic fallback for unaligned shapes. -
Optimized GPU
topkstage-1 kernel with a per-thread register heap that caches the top-8 elements during a single scan pass, eliminating redundant global memory re-reads for the first 8 extraction iterations. -
Moved
partial_simd_loadandpartial_simd_storefrombuffer.buffertolinalg.utilsand removed thebufferpackage. Update imports fromfrom buffer.buffer import ...tofrom linalg.utils import ....
🛠️ Fixed
-
Fixed MAX tools aborting at startup with
std::filesystem::filesystem_errorwhen$HOMEis not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412) -
Fixed
enqueue_fill()taking O(N) HIP API calls forfloat64buffers on AMD GPUs when the high and low 32-bit halves of the fill value differ (e.g.,2.0), reducing the call count to O(log N). (Issue #6417)
Mojo language
For all the updates to the Mojo language, standard library, and tools,
including all GPU programming and Layout/LayoutTensor changes, see the Mojo
changelog
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!