v26.2 (2026-03-19)
Highlights
-
MAX now supports image generation with FLUX diffusion models (
FLUX.1-devandFLUX.2-dev), served through a new/v1/responsesendpoint with the OpenResponses API. See the image generation guide to get started. -
Significant DeepSeek improvements: added support for DeepSeekV3.2 with multi-latent attention, NVFP4 quantization support for DeepSeek-R1 (with expert parallelism), and expert parallelism now supports more than 32 local experts without requiring NVSHMEM for single-node deployments.
-
Major Blackwell (SM100) kernel optimizations, including SnapMLA for MLA decode, hardware-accelerated conv2d with TMA im2col for FLUX VAE, fused epilogues in BF16 and FP8 matmul kernels, and FP8 MMA support for MLA prefill with blockwise scaling.
Documentation
-
Refactored the MAX Python API reference into a flat list of module pages. Each summary page organizes APIs based on conceptual groups instead of source file locations. All API members also include a direct link to the source code on GitHub.
-
Added Basic operations to the model developer guide, covering tensor arithmetic, shape manipulation, reductions, matrix operations, activation functions, and random tensor generation.
-
Added Model pipeline to the model developer guide, explaining how to connect models to MAX's serving infrastructure with inference pipelines that handle weight loading, KV cache management, request batching, and tokenization.
-
Added Image generation to the inference guide, showing how to generate images from text prompts or transform existing images using the
v1/responsesendpoint with FLUX models. -
Added the Environment variables reference, documenting all configurable MAX environment variables for server settings, logging, telemetry, debugging, performance, and Hugging Face integration.
MAX models
-
Added support for FLUX image generation models (
black-forest-labs/FLUX.1-devandFLUX.2-dev). Supports fused graph compilation, batched VAE decoding, GPU-side post-processing, and first-block caching for repeated prompts. -
Added support for Kimi vision-language models (
moonshotai/Kimi-K2.5andKimi-VL-A3B-Instruct). Supports multi-GPU tensor parallelism, a custom vision processor, learnable 2D position embeddings, and tiktoken tokenizer. -
Added support for OLMo 3 models (
Olmo3ForCausalLM), for exampleallenai/Olmo-3-7B-Instruct. -
Added support for Qwen3-MoE models (
Qwen3MoeForCausalLM), for exampleQwen/Qwen3-30B-A3B-Instruct, with multi-GPU tensor parallelism and FP8 quantization support. -
DeepSeek improvements:
- Added support for the DeepSeekV3.2 architecture with multi-latent attention and fused FP8 paged KV cache.
- Added NVFP4 quantization support for DeepSeek-R1, including with expert parallelism.
- Expert parallelism now supports more than 32 local experts and no longer requires NVSHMEM for single-node deployments.
- Improved memory estimation for NVFP4-quantized models and EP communication buffers.
- Added FP4 quantization support for the DeepSeek MTP speculative decoding module.
- Various fixes: decode-only mode, missing
rope_scalingconfig, DeepSeek-V2-Lite gather-index OOB, re-enabled multi-GPU TP for DeepSeek-V2-Lite-Chat.
-
Removed legacy Gemma 3 multimodal implementation and the
MODULAR_MAX_DISABLE_GEMMA3_VISIONenvironment variable. -
Fixed multi-GPU tensor parallelism for GPT-OSS MoE models.
-
Common MAX models like Qwen 2.5 can now run on AMD RDNA consumer GPUs.
-
Improved Mistral3 text encoder performance by compiling hidden-state selection and eliminating redundant GPU transfers.
-
Fixed prompt validator for Qwen2.5-VL models.
-
Fixed audio generator pipeline to restore audio generation support.
-
Fixed multi-GPU NVFP4 inference for Llama3.
-
Fixed Idefics3 chat template image placeholder ordering.
-
Added MXFP4 quantization support for GPT-OSS models (such as
openai/gpt-oss-20b).
MAX framework
-
Upgraded the bundled
libnvptxcompilerfrom CUDA 12.9 to CUDA 13.1, which requires NVIDIA GPU driver 580 or higher. This brings the latest bug fixes and performance improvements from NVIDIA's PTX compiler, as well as fully supporting new hardware like the DGX Spark and Jetson Thor.To use MAX and Mojo with older NVIDIA drivers and hardware, you can set the
MODULAR_NVPTX_COMPILER_PATHenvironment variable to point to a systemptxasbinary, instead of using the bundledlibnvptxcompilerversion.The Mojo
DeviceContext()constructor now checks NVIDIA driver compatibility at creation time and provides a clear error message when the driver version is too old, matching the behavior of the PythonAccelerator()API. -
Runtime GPU errors now include a Python source traceback, showing where the failing operation was defined in your graph-building code. Build with
MODULAR_MAX_DEBUG=Trueto enable source note collection; when source notes aren't available, error messages include a hint about how to enable them. -
Added
MODULAR_DEBUG_DEVICE_ALLOCATORenvironment variable for debugging GPU memory issues. Set touninitialized-poisonto fill buffers with sentinel values (qNaN for floats,0xCDfor others) to detect use of uninitialized data, orout-of-boundsto enable redzone checks for buffer overflows. Accepts a comma-separated list for multiple options. -
Fixed a memory leak in CUDA graph execution where output buffers were not freed between replays, causing GPU memory to grow over time during sustained inference.
-
Fixed compilation cache misses when cross-compiling GPTQ and LoRA models on machines without a GPU. Weight dtype casting now skips the actual data conversion in virtual device mode, because only compilation metadata is needed.
-
Enabled peer-to-peer device memory access for AMD HIP multi-GPU configurations, enabling direct GPU-to-GPU memory transfers on AMD hardware.
-
Fixed multi-GPU communication silently falling back to a slower transport on systems where
rdma-coreis installed without dev packages (common in production containers). -
Fixed multi-GPU broadcast operations failing with "Broadcast currently requires P2P access between GPUs," due to a regression in peer-to-peer device access initialization.
-
Improved Hugging Face model downloads: gated repo errors now surface clearly instead of showing a misleading "check the repo name" messages.
Inference server
-
Added image generation support via a new
/v1/responsesendpoint implementing the OpenResponses API standard. Enable it by addingresponsestoMAX_SERVE_API_TYPES(for example,MAX_SERVE_API_TYPES='["openai","responses"]'). Currently supports FLUX diffusion models. For more information, see the image generation guide. -
Added
output_formatparameter to image generation requests, allowing clients to choose JPEG, PNG, or WEBP output per request (default remains JPEG). -
Overlap scheduling is now auto-enabled for select model architectures like
LlamaForCausalLM_Legacy, and is compatible with prefix caching. This reduces CPU overhead by overlapping Python host code with GPU kernel execution. It's currently incompatible with some features such as structured outputs and CPU models. It's still experimental and you can disable it with--no-enable-overlap-scheduler --force. -
Speculative decoding improvements:
- Added typical-acceptance rejection sampling.
- Added
rejection-sampling-strategyoption (greedyorresidual) for speculative decoding. Defaults toresidual; usegreedyfor models that pass hidden states. - Applied repetition/frequency/presence penalty sampling in EAGLE.
- Enabled weight sharing between MTP draft and main model to reduce memory.
- Added support for chunked prefill with EAGLE and MTP speculative decoding.
- Fixed batch context length calculation for draft models.
- Fixed Eagle penalty inputs being unconditionally applied.
-
EAGLE speculative decoding now reports the draft token acceptance rate in scheduler metrics output.
-
Added KV cache offloading: KV cache blocks can now spill from GPU to CPU memory and disk when GPU memory is full, enabling larger effective cache capacity and warm restarts. Includes LMCache integration for sharing KV cache across model instances via external storage (CPU, disk, Redis), with multi-GPU tensor parallelism support.
-
CUDA graph capture is now auto-enabled for Llama models when
max_batch_sizeis set, reducing per-token latency. You can opt out with--no-device-graph-capture --force. -
Added FP8 quantization support for the KV cache, reducing KV cache memory usage. Configure via
--kv-cache-format float8_e4m3fn(also supportsfloat32andbfloat16). -
Added configurable batch scheduling strategy for text generation via the
MAX_SERVE_BATCH_PRIORITYenvironment variable. It defines how the scheduler prioritizes between prefill (context encoding) and decode (token generation) when constructing batches. Options:prefill_first(minimize time-to-first-token),decode_first(minimize inter-token latency),balanced(adaptive based on global queue state), orper_replica(each replica decides independently; default). -
Diffusion models can now specify a default
num_inference_stepsper architecture. -
Added
--first-block-cachingflag to enable first-block caching (FBCache) for diffusion models like FLUX, and--residual-thresholdfor the TaylorSeer caching strategy. Both are configurable viamax serveandmax generate. -
Enabled
logprobsin chat completion responses, returning per-token log probabilities. -
Non-streaming requests are now cancelled when the client disconnects, preventing zombie requests from consuming KV cache memory.
-
Improved streaming performance by buffering generated tokens and detokenizing them in batches rather than one at a time, reducing CPU overhead and improving GPU utilization.
-
Improved multi-GPU AllReduce performance by launching per-device kernels in parallel async tasks instead of sequentially.
-
Fixed a server hang when a model worker process crashes before it finishes initializing.
-
Fixed per-request seed handling in TopK/TopP sampling. Seeds are now correctly applied per request instead of using a single seed for the entire batch.
-
Fixed KV cache blocks not being released after offline text generation (
generate()/generate_async()), which could cause block exhaustion during sustained inference. -
Fixed three resource leaks in the disaggregated inference decode scheduler: KV cache blocks leaked on request cancellation, replica load-balancing counter drift over time, and a
KeyErrorcrash on stale prefill responses arriving after cancellation.
max CLI
- Added the
--device-graph-captureflag to enable CUDA graph capture for serving, reducing per-token latency by replaying recorded GPU kernel launches. Auto-enabled for Llama and DeepSeek V3; opt out with--no-device-graph-capture --force. - Added the
--debug-verify-replayflag to run eager launch-trace verification before device graph replay, for debugging CUDA graph correctness issues. - Added the
--kv-cache-formatflag to set the KV cache data type at runtime. Acceptsfloat32,bfloat16, orfloat8_e4m3fnfor FP8 quantized caching. - Added the
--lmcache-config-fileflag to enable LMCache-based external KV cache tiering. Point it at an LMCache YAML config to share KV cache blocks across model instances via CPU, disk, or remote storage. - Added the
--reasoning-parserflag tomax serveto enable extraction of model thinking/reasoning content into a separatereasoningfield on the OpenAI API response. Currently supports Kimi K2.5 (kimi-k2), with a registry for adding additional parsers. - Added the
--rejection-sampling-strategyflag to select the rejection sampling method for speculative decoding. Options:greedy,residual(default for standalone), ortypical-acceptance(default for EAGLE/MTP). Usegreedyfor models that pass hidden states. max benchmarknow uses the model's default temperature when none is specified.max benchmarkno longer overridestop_punless the user provides a value.- Removed the
--cache-strategyflag.
Python API
-
Tensor.constant()is deprecated. Use theTensor(data, dtype=..., device=...)constructor directly, matching PyTorch'storch.tensor()semantics. For example, replaceTensor.constant([1.0, 2.0])withTensor([1.0, 2.0]).Tensor.constant()will be removed in a future release. -
DeviceEventnow accepts anenable_timing=Trueparameter to enable GPU event timing. Usestart.elapsed_time(end)to measure elapsed GPU time in milliseconds between two timing-enabled events. -
Added the
prodop for computing the product of elements along an axis, available asmax.graph.ops.prod,max.experimental.functional.prod, andTensor.prod(). -
Device.statsnow includesgraph_mem_reservedandgraph_mem_usedfields for device graph memory observability. -
Module.compile()now validates weight names, dtypes, and shapes before loading, surfacing mismatches as Python errors instead of runtime crashes during asynchronous host-to-device transfers. -
InferenceSessionnow automatically includes the CPU in its device list, removing the need to manually add it when graphs include host-side values. -
Added
max.graph.ops.broadcastfor distributed broadcast across devices. RaisesValueErrorwhensignal_buffersis empty. -
Added manual synchronization API (
DevicePinnedBuffer,DeviceEvent) for controlling buffer readiness and reducing stream synchronization overhead. -
Tensor.cast()is now idempotent for same-dtype casts. -
Added
F.condto the experimental functional API for conditional execution. -
Added fast path for
Tensor.to(device)in eager mode. -
Added
Dim-based scalar dimension API toModule.compile(). -
Moduleis now device-aware viato()for unified device placement. -
Module.load_state_dict()now validates weight attribute names. -
Algebraic dims and graph/custom op construction now works without an explicit context manager, by using a global MLIR context. Threadpool-backed MAX paths now scope worker-thread MLIR usage to the default context automatically.
-
Renamed
Float8ConfigtoQuantConfig(and related types/functions) to reflect that the config now covers FP8, NVFP4, and MXFP4 quantization. -
Renamed related public Python quantization APIs from
Float8*names toQuant*names, includingparse_float8_config()toparse_quant_config(), and the publicquantmodules inmax.nnandmax.pipelines.lib. -
max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured.
Breaking changes
-
Reorganized
max.nnnamespace. The graph-based neural network API has been restored as the defaultmax.nnnamespace (previously located undermax.nn.legacy). The eager module API has moved frommax.nntomax.nn.module_v3. Additionally,max.tensor,max.functional, andmax.randomhave moved back undermax.experimental(max.experimental.tensor,max.experimental.functional,max.experimental.random). Update imports accordingly. -
Moved experimental APIs under
max.experimental. Two additional packages have moved under themax.experimentalnamespace to co-locate all experimental APIs:-
max.torchis nowmax.experimental.torch. Update imports fromfrom max.torch import CustomOpLibrary, graph_optofrom max.experimental.torch import CustomOpLibrary, graph_op. -
max.nn.module_v3is nowmax.experimental.nn(thev3suffix has been dropped). Update imports fromfrom max.nn.module_v3 import Module, Lineartofrom max.experimental.nn import Module, Linear.
-
-
Removed
PipelineConfig.max_length. Themax_lengthparameter now resides at the model configuration level asMAXModelConfig.max_length(accessible asconfig.model.max_length). This change correctly places the parameter at the model level since it describes model capacity (maximum sequence length the model can process), not pipeline runtime behavior. Update all configurations and code to usemodel.max_lengthinstead of the removedmax_lengthfield at the pipeline level. -
PipelineModelno longer accepts theencodingparameter. Theencodingparameter has been removed fromPipelineModel.__init__and all subclasses. The encoding is now automatically inferred frompipeline_config.model.quantization_encoding. This change eliminates redundant parameter passing and ensures a single source of truth for quantization encoding configuration. -
Device-graph APIs now require explicit caller-provided graph keys for capture/replay/verification. Update calls from
model.capture(*inputs),model.replay(*inputs), andmodel.debug_verify_replay(*inputs)tomodel.capture(graph_key, *inputs),model.replay(graph_key, *inputs), andmodel.debug_verify_replay(graph_key, *inputs). -
Removed
q_max_seq_lenfromKVCacheParams; accepted via graph capture instead. -
MAXBaseModelnow usesextra=forbidandstrict=True; configs with unknown fields will be rejected. -
Replaced
disable_auto_sync/mark_as_readywithDevicePinnedBufferandDeviceEventfor pinned memory management.
MAX kernels
-
Blackwell (SM100) GPU performance:
- Optimized Attention on SM100 by skipping unnecessary softmax corrections when the row maximum change is small.
- Fused epilogue into SM100 BF16 and FP8 matmul kernels.
- Improved SM100 FP8 matmul dispatch for small M shapes (M <= 128).
- Fixed matmul kernel dispatch on SM100.
- Added SM100 hardware-accelerated conv2d with TMA im2col and fused residual epilogue for FLUX VAE.
- Added batched BF16 matmul support for SM100.
- Added SnapMLA implementation for SM100 MLA decode.
- Added FP8 tensorwise and block-scale MLA decode for SM100/B200.
- Added FP8 MMA support for MLA prefill with blockwise scaling and K RoPE.
- Enabled MLA attention for SM100 GPUs.
- Enabled 64x256 N split MMA for B200 MLA decode (long context).
- Used TMA for KV scale loads in attention kernels (SM100).
-
AMD GPU kernel improvements:
- Tuned and optimized GEMV split-K BF16 dispatch and kernel for AMD GPUs.
- Enabled FP8 GEMV kernel on AMD GPUs.
- Reduced K buffer bank conflicts in MHA prefill on AMD via swizzle.
- Integrated AMD pingpong kernel with FP8 dispatch and fixed TP > 1.
- Fixed out-of-bounds masking and depths > 256 on AMD RDNA GPUs.
- Enabled rocSHMEM GDA backend with TCP bootstrap for multi-node AMD EP.
-
Grouped matmul improvements (SM100):
- Added MMA_N=64 support for 1D1D block-scaled grouped matmul.
- Added 2SM support to structured 1D1D grouped matmul kernel.
- Enabled swapAB for block-scaled grouped matmul and block-scaled matmul on SM100.
- Added tensor scale factor to block-scaled 1D1D grouped matmul.
- Added bf16 scales support to blockwise FP8 grouped matmul.
-
DeepSeek kernel optimizations:
- Added BF16 MLA prefill/decode mega-kernel.
- Enabled BF16 graph execution path for Multi-Latent Attention.
- Enabled fused QKV projection for latent attention with RoPE.
- Fused RoPE and RMSNorm into MLA custom ops.
- Fused epilogue operations in DeepSeek BF16 matmul kernels.
- Added fused dispatch and combine kernels for expert parallelism.
- Enabled Mojo BF16 matmul kernels and FP4 kernels for DeepSeek shapes.
- Fixed blockwise FP8 batched matmul for non-row-major layouts.
-
Multi-GPU distributed ops:
- Added fused allreduce + RMSNorm + FP8 kernel with residual path and 2-stage allreduce for tensor-parallel workloads.
- Added distributed scatter graph op for multi-GPU DP>1 inference.
- Fixed and optimized broadcast kernel for BF16/FP16 with multimem on GPU.
- Fixed and optimized 2-stage broadcast kernel for multi-GPU.
-
FLUX kernel improvements: Autotuned cuDNN convolution algorithm selection and cached results. Added multi-block GroupNorm GPU kernel. Enabled high-performance Mojo matmul kernels for FLUX.2. Fixed grouped conv2d on GPU incorrectly ignoring the
num_groupsparameter. -
kbenchnow runs benchmarks via shared library (.so) by default, reusing persistent workers and CUDA contexts instead of spawning subprocesses. Benchmark execution phase is ~10x faster (for example, 4.25 h → 0.4 h on a tuning workload). Falls back to subprocess mode when profiling or using custom exec wrappers. -
Added MXFP4 dequant and matmul kernels.
-
Optimized FP4 matmul dispatch for Llama-style shapes and added FP4 GEMM dispatch configs for additional shape coverage.
-
Used asynchronous FP4 quantization kernel for improved throughput.
-
Optimized Hopper matmul for M=256 and small M shapes via swapAB.
-
Improved GEMV kernel performance. Integrated Flash Infer TopK kernel for improved sampling performance.
-
Improved layer normalization kernel performance.
-
Added FP8 support to FlashMLA decode kernel.
-
Fixed FP8 cast lambda epilogue in matmul.
-
Fixed NaN in MLA decode split-K kernel with causal masking.
-
Fixed warpgroup deadlock in MLA decode that could cause hangs on DeepSeek models.
-
Fixed incorrect MoE expert routing caused by bitonic sort merge direction bug.
-
Fixed int8 matmul dispatch on ARM64.
-
Fixed Metal buffer tracking for sub-buffers and tensor slices on Apple Silicon.
Mojo language
For all the updates to the Mojo language, standard library, and tools,
including all GPU programming and Layout/LayoutTensor changes, see the Mojo
changelog
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!