Nightly: v26.3
This version is still a work in progress.
Highlights
Documentation
MAX models
- The
residual_thresholdparameter for FLUX first-block cache (FBCache) is now a per-request runtime parameter onImageProviderOptions, allowing it to be tuned without recompiling the model graph. - Added TaylorSeer denoising cache support to the FLUX.2 Klein pipeline, enabling significant speedups for image-to-image generation by skipping redundant transformer passes during the denoising loop.
- Added the Mamba state space model architecture.
MAX framework
Inference server
- Added periodic "still building/compiling" log messages during model compilation so that long operations produce visible signs of progress.
- Consolidated KV connector CLI flags (
--host-kvcache-swap-space-gb,--disk-offload-dir,--disk-offload-max-gb,--disk-offload-direct-io,--lmcache-config-file) into the--kv-connector-configJSON dict. - Removed the
--allow-safetensors-weights-fp32-bf16-bidirectional-castCLI flag. Float32 <-> bfloat16 safetensors weight casting is now unconditionally enabled. - Added
--model-overrideCLI flag for per-componentModelManifestoverrides (e.g.--model-override transformer.quantization_encoding=float4_e2m1fnx2), enabling mixed quantization in diffusion pipelines.
max CLI
Python API
- Fixed tensor slicing with negative integer indices (e.g.
hidden[:, -1]) which previously raised aRuntimeErrorat compile time. - Setting
MODULAR_MAX_UNINITIALIZED_READ_CHECK=trueenables detection of uninitialized memory reads in Mojo kernels.InferenceSessionautomatically enables the debug allocator poison and compiles kernels with load-time poison checks for all float types. When a load matches a poison pattern, the process aborts with a descriptive message. - Added support for the
bfloat16data type on ARM CPU devices in MAX graphs. Previously,session.load()raised aValueErrorwhen a graph contained bf16 tensors targeting an ARM CPU. - Added
DevicePlacementPolicy(Ignore,Warn,Error) toGraphto control behavior when CPU-only ops (ops.scatter,ops.cumsum,ops.nonzero,ops.tile) receive GPU tensors. The default (Warn) emits aUserWarningand falls back to CPU;ErrorraisesValueErrorinstead.ops.condandops.while_loopalways raiseValueErrorfor GPU predicates. - Fixed slow
axis=Nonereductions (mean,sum,prod,max,min) inmax.experimental.functional. The previous implementation flattened the tensor before reducing, serializing the work onto a single GPU block. Reductions now iterate axis-by-axis to preserve parallelism. - Renamed
Float8ConfigtoQuantConfig(and related types/functions) to reflect that the config now covers FP8, NVFP4, and MXFP4 quantization. - Renamed related public Python quantization APIs from
Float8*names toQuant*names, includingparse_float8_config()toparse_quant_config(), and the publicquantmodules inmax.nnandmax.pipelines.lib. max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured.- Introduced
CPUMetricsalongside the existing GPU diagnostics and open source it under frommax.diagnostics. - Added experimental
max.experimental.distributedmodule withDTensor,DeviceMesh, and placement types (Replicated,Sharded,Partial) for expressing how tensors are distributed across multiple devices. Op dispatch is not yet supported. - Improved experimental eager interpreter performance by enabling multi-threaded CPU execution and removing unnecessary GPU device synchronization after each op dispatch.
- Added
gatherandgather_ndop handlers to the experimental eager interpreter with full CPU and GPU support. - Added
argmaxandargminop handlers to the experimental eager interpreter with full CPU and GPU support, returning int64 indices along a specified axis. - Added
splitop handler to the experimental eager interpreter with full CPU and GPU support, splitting a tensor into multiple outputs along a specified axis. - Added
scatterop handler to the experimental eager interpreter (CPU), scattering updates into a copy of the input tensor along a specified axis. - Added
scatter_ndop handler to the experimental eager interpreter (CPU and GPU), scattering slices from updates into input at N-dimensional index positions viamax.experimental.functional.scatter_nd. - Added
scatter_nd_addop handler to the experimental eager interpreter (CPU), accumulating slices from updates into input at N-dimensional index positions and summing duplicate indices viamax.experimental.functional.scatter_nd_add. - Added
conv2dandconv2d_transposeop handlers to the experimental eager interpreter with CPU and GPU support. - Added
max_pool2dop handlers (floor and ceil mode) to the experimental eager interpreter with CPU and GPU support. - Added
tileop handler to the experimental eager interpreter (CPU), repeating the input tensor along each dimension. - Added
band_partop handler to the experimental eager interpreter with CPU and GPU support, masking tensor matrices based on a diagonal band. - Added
avg_pool2dop handlers (floor and ceil mode) to the experimental eager interpreter with CPU and GPU support. - Added
top_kop handler to the experimental eager interpreter with CPU and GPU support, returning the top-k values and their original indices along a specified axis. - Added
bottom_kop handler to the experimental eager interpreter with CPU and GPU support, returning the k smallest values and their original indices along a specified axis viamax.experimental.functional.bottom_k. - Added
nonzeroop handler to the experimental eager interpreter (CPU), returning the row-major coordinates of all nonzero elements as a[nnz, rank]int64 tensor viamax.experimental.functional.nonzero. - Added
scatter_addop handler to the experimental eager interpreter (CPU), accumulatingupdatesinto a copy ofinputatindicesalongaxisand summing duplicate indices viamax.experimental.functional.scatter_add. - Added
max.graph.ops.scatter_max,max.graph.ops.scatter_min, andmax.graph.ops.scatter_mulgraph operations (and correspondingmax.experimental.functionalwrappers) for element-wise scatter with max, min, and multiply reductions at duplicate indices along an axis. - Added
scatter_max,scatter_min, andscatter_mulop handlers to the experimental eager interpreter (CPU), applying max, min, and multiply reductions at duplicate scatter indices viamax.experimental.functional.scatter_max,.scatter_min, and.scatter_mul. - Added
max.graph.ops.scatter_nd_max,max.graph.ops.scatter_nd_min, andmax.graph.ops.scatter_nd_mulgraph operations (and correspondingmax.experimental.functionalwrappers) for N-dimensional scatter with max, min, and multiply reductions at duplicate index vectors. - Added
scatter_nd_max,scatter_nd_min, andscatter_nd_mulop handlers to the experimental eager interpreter (CPU), applying max, min, and multiply reductions at duplicate N-dimensional scatter indices viamax.experimental.functional.scatter_nd_max,.scatter_nd_min, and.scatter_nd_mul. max.graph.ops.pad(andmax.graph.experimental.functional.pad) now acceptsmode='reflect'andmode='edge'in addition tomode='constant'.- Added
padop handlers (pad.constant,pad.reflect,pad.repeat) to the experimental eager interpreter.pad.constantsupports CPU and GPU;pad.reflectandpad.repeat(edge padding) run on CPU. - Added
max.graph.ops.resize_linearfor linear (bilinear) interpolation resizing with configurablecoordinate_transform_mode(half_pixel, align_corners, asymmetric, half_pixel_1D) and optionalantialiasdownscaling support;max.graph.ops.resizenow supportsInterpolationMode.BILINEARby delegating toresize_linear. - Added
resize_linearop handler to the experimental eager interpreter (CPU) viamax.experimental.functional.resize_linear. - Added
max.graph.ops.resize_nearestfor nearest-neighbor interpolation resizing with configurablecoordinate_transform_modeandround_mode;max.graph.ops.resizenow supportsInterpolationMode.NEAREST. - Added
resize_nearestop handler to the experimental eager interpreter (CPU) viamax.experimental.functional.resize_nearest. - Added
max.graph.ops.resize_bicubicfor bicubic interpolation resizing (rank-4 NCHW, half_pixel coord mapping, a=-0.75 Catmull-Rom kernel);max.graph.ops.resizenow delegates itsInterpolationMode.BICUBICpath toresize_bicubic. - Added
resize_bicubicop handler to the experimental eager interpreter (CPU) viamax.experimental.functional.resize_bicubic. - Added defensive
mo.shape.from_tensorandmo.index.to_tensorhandlers to the experimental eager interpreter. These internal ops are typically folded away by canonicalization; the handlers prevent crashes if they survive into the interpreter. - Added defensive
mo.buffer.createandmo.buffer.transferhandlers to the experimental eager interpreter. These internal ops are typically lowered by the graph compiler; the handlers prevent crashes if they survive into the interpreter. - Added defensive
mo.gather_sumhandler to the experimental eager interpreter. This fused composite op (gather axis 0 + sum axis 1) is used by DLRM-style multi-hot embeddings; the handler prevents crashes if the op survives into the interpreter. - Added
distributed.allreduce.sumop handler to the experimental eager interpreter, enabling multi-GPU eager execution of allreduce collectives - Added
distributed.allgatherop handler to the experimental eager interpreter, enabling multi-GPU eager execution of allgather collectives without falling back to compilation. - Added
distributed.scatterop handler to the experimental eager interpreter, enabling multi-GPU eager execution of scatter collectives without falling back to compilation. - Added
distributed_scattercollective todistributed_functionalfor hardware-accelerated root-to-device tensor distribution. - Added
distributed.broadcastop handler to the eager interpreter, enabling multi-GPU eager execution of broadcast collectives without falling back to compilation. - Added
distributed_broadcastcollective todistributed_functionalfor hardware-accelerated root-to-all tensor replication. Module.compile()now accepts acustom_extensionsparameter for loading custom Mojo kernel libraries at graph construction time, fixing validation failures for kernels with struct-level parameters.- Fixed
torch.compile(fullgraph=True)failing with an "Unsupported context manager" error when accessingCustomOpLibraryops inside the compiled function. Ops are now eagerly compiled during library initialization.
Breaking changes
-
Removed individual KV connector CLI flags (
--host-kvcache-swap-space-gb,--disk-offload-dir,--disk-offload-max-gb,--disk-offload-direct-io,--lmcache-config-file). Use--kv-connector-configwith a JSON dict instead. -
max/python/max/benchmark/benchmark_throughput.pyhas been deprecated and will be removed in a future MAX release. -
Removed
DimandDimListtypes frombuffer.dimlist. Custom kernel code using these types should migrate toIntTupleandTileLayoutfrom thelayoutpackage.
Mojo API
Custom ops
MAX kernels
- Added GPU kernel examples from the Programming Massively Parallel Processors (PMPP) textbook covering reductions, scans, histograms, sorting, sparse matrix operations, graph algorithms, convolutions, FlashAttention, and more.
- Improved NVFP4 grouped matmul kernel performance, now outperforming FlashInfer across all tested decoding and prefill shapes for Kimi K2.5 on B200.
- Optimized GPU
layer_normkernels with SIMD reductions, gamma/beta prefetch, and asimd_width*2warp tiling dispatch path. - Optimized GPU
pad_constantkernel with SIMD vectorization (simd_width=4) and added a kbench benchmark suite (bench_pad). - Improved GPU
topkandargsortkernel performance by nearly 2x. - Optimized GPU
concatwith a flat-indexing kernel that avoids multi-dimensional index decomposition, using 128-bit vectorized loads with automatic fallback for unaligned shapes. - Optimized GPU
topkstage-1 kernel with a per-thread register heap that caches the top-8 elements during a single scan pass, eliminating redundant global memory re-reads for the first 8 extraction iterations.
Mojo language
For all the updates to the Mojo language, standard library, and tools,
including all GPU programming and Layout/LayoutTensor changes, see the Mojo
changelog
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!