Skip to main content

v26.2 (2026-03-19)

Highlights

  • MAX now supports image generation with FLUX diffusion models (FLUX.1-dev and FLUX.2-dev), served through a new /v1/responses endpoint with the OpenResponses API. See the image generation guide to get started.

  • Significant DeepSeek improvements: added support for DeepSeekV3.2 with multi-latent attention, NVFP4 quantization support for DeepSeek-R1 (with expert parallelism), and expert parallelism now supports more than 32 local experts without requiring NVSHMEM for single-node deployments.

  • Major Blackwell (SM100) kernel optimizations, including SnapMLA for MLA decode, hardware-accelerated conv2d with TMA im2col for FLUX VAE, fused epilogues in BF16 and FP8 matmul kernels, and FP8 MMA support for MLA prefill with blockwise scaling.

Documentation

  • Refactored the MAX Python API reference into a flat list of module pages. Each summary page organizes APIs based on conceptual groups instead of source file locations. All API members also include a direct link to the source code on GitHub.

  • Added Basic operations to the model developer guide, covering tensor arithmetic, shape manipulation, reductions, matrix operations, activation functions, and random tensor generation.

  • Added Model pipeline to the model developer guide, explaining how to connect models to MAX's serving infrastructure with inference pipelines that handle weight loading, KV cache management, request batching, and tokenization.

  • Added Image generation to the inference guide, showing how to generate images from text prompts or transform existing images using the v1/responses endpoint with FLUX models.

  • Added the Environment variables reference, documenting all configurable MAX environment variables for server settings, logging, telemetry, debugging, performance, and Hugging Face integration.

MAX models

  • Added support for FLUX image generation models (black-forest-labs/FLUX.1-dev and FLUX.2-dev). Supports fused graph compilation, batched VAE decoding, GPU-side post-processing, and first-block caching for repeated prompts.

  • Added support for Kimi vision-language models (moonshotai/Kimi-K2.5 and Kimi-VL-A3B-Instruct). Supports multi-GPU tensor parallelism, a custom vision processor, learnable 2D position embeddings, and tiktoken tokenizer.

  • Added support for OLMo 3 models (Olmo3ForCausalLM), for example allenai/Olmo-3-7B-Instruct.

  • Added support for Qwen3-MoE models (Qwen3MoeForCausalLM), for example Qwen/Qwen3-30B-A3B-Instruct, with multi-GPU tensor parallelism and FP8 quantization support.

  • DeepSeek improvements:

    • Added support for the DeepSeekV3.2 architecture with multi-latent attention and fused FP8 paged KV cache.
    • Added NVFP4 quantization support for DeepSeek-R1, including with expert parallelism.
    • Expert parallelism now supports more than 32 local experts and no longer requires NVSHMEM for single-node deployments.
    • Improved memory estimation for NVFP4-quantized models and EP communication buffers.
    • Added FP4 quantization support for the DeepSeek MTP speculative decoding module.
    • Various fixes: decode-only mode, missing rope_scaling config, DeepSeek-V2-Lite gather-index OOB, re-enabled multi-GPU TP for DeepSeek-V2-Lite-Chat.
  • Removed legacy Gemma 3 multimodal implementation and the MODULAR_MAX_DISABLE_GEMMA3_VISION environment variable.

  • Fixed multi-GPU tensor parallelism for GPT-OSS MoE models.

  • Common MAX models like Qwen 2.5 can now run on AMD RDNA consumer GPUs.

  • Improved Mistral3 text encoder performance by compiling hidden-state selection and eliminating redundant GPU transfers.

  • Fixed prompt validator for Qwen2.5-VL models.

  • Fixed audio generator pipeline to restore audio generation support.

  • Fixed multi-GPU NVFP4 inference for Llama3.

  • Fixed Idefics3 chat template image placeholder ordering.

  • Added MXFP4 quantization support for GPT-OSS models (such as openai/gpt-oss-20b).

MAX framework

  • Upgraded the bundled libnvptxcompiler from CUDA 12.9 to CUDA 13.1, which requires NVIDIA GPU driver 580 or higher. This brings the latest bug fixes and performance improvements from NVIDIA's PTX compiler, as well as fully supporting new hardware like the DGX Spark and Jetson Thor.

    To use MAX and Mojo with older NVIDIA drivers and hardware, you can set the MODULAR_NVPTX_COMPILER_PATH environment variable to point to a system ptxas binary, instead of using the bundled libnvptxcompiler version.

    The Mojo DeviceContext() constructor now checks NVIDIA driver compatibility at creation time and provides a clear error message when the driver version is too old, matching the behavior of the Python Accelerator() API.

  • Runtime GPU errors now include a Python source traceback, showing where the failing operation was defined in your graph-building code. Build with MODULAR_MAX_DEBUG=True to enable source note collection; when source notes aren't available, error messages include a hint about how to enable them.

  • Added MODULAR_DEBUG_DEVICE_ALLOCATOR environment variable for debugging GPU memory issues. Set to uninitialized-poison to fill buffers with sentinel values (qNaN for floats, 0xCD for others) to detect use of uninitialized data, or out-of-bounds to enable redzone checks for buffer overflows. Accepts a comma-separated list for multiple options.

  • Fixed a memory leak in CUDA graph execution where output buffers were not freed between replays, causing GPU memory to grow over time during sustained inference.

  • Fixed compilation cache misses when cross-compiling GPTQ and LoRA models on machines without a GPU. Weight dtype casting now skips the actual data conversion in virtual device mode, because only compilation metadata is needed.

  • Enabled peer-to-peer device memory access for AMD HIP multi-GPU configurations, enabling direct GPU-to-GPU memory transfers on AMD hardware.

  • Fixed multi-GPU communication silently falling back to a slower transport on systems where rdma-core is installed without dev packages (common in production containers).

  • Fixed multi-GPU broadcast operations failing with "Broadcast currently requires P2P access between GPUs," due to a regression in peer-to-peer device access initialization.

  • Improved Hugging Face model downloads: gated repo errors now surface clearly instead of showing a misleading "check the repo name" messages.

Inference server

  • Added image generation support via a new /v1/responses endpoint implementing the OpenResponses API standard. Enable it by adding responses to MAX_SERVE_API_TYPES (for example, MAX_SERVE_API_TYPES='["openai","responses"]'). Currently supports FLUX diffusion models. For more information, see the image generation guide.

  • Added output_format parameter to image generation requests, allowing clients to choose JPEG, PNG, or WEBP output per request (default remains JPEG).

  • Overlap scheduling is now auto-enabled for select model architectures like LlamaForCausalLM_Legacy, and is compatible with prefix caching. This reduces CPU overhead by overlapping Python host code with GPU kernel execution. It's currently incompatible with some features such as structured outputs and CPU models. It's still experimental and you can disable it with --no-enable-overlap-scheduler --force.

  • Speculative decoding improvements:

    • Added typical-acceptance rejection sampling.
    • Added rejection-sampling-strategy option (greedy or residual) for speculative decoding. Defaults to residual; use greedy for models that pass hidden states.
    • Applied repetition/frequency/presence penalty sampling in EAGLE.
    • Enabled weight sharing between MTP draft and main model to reduce memory.
    • Added support for chunked prefill with EAGLE and MTP speculative decoding.
    • Fixed batch context length calculation for draft models.
    • Fixed Eagle penalty inputs being unconditionally applied.
  • EAGLE speculative decoding now reports the draft token acceptance rate in scheduler metrics output.

  • Added KV cache offloading: KV cache blocks can now spill from GPU to CPU memory and disk when GPU memory is full, enabling larger effective cache capacity and warm restarts. Includes LMCache integration for sharing KV cache across model instances via external storage (CPU, disk, Redis), with multi-GPU tensor parallelism support.

  • CUDA graph capture is now auto-enabled for Llama models when max_batch_size is set, reducing per-token latency. You can opt out with --no-device-graph-capture --force.

  • Added FP8 quantization support for the KV cache, reducing KV cache memory usage. Configure via --kv-cache-format float8_e4m3fn (also supports float32 and bfloat16).

  • Added configurable batch scheduling strategy for text generation via the MAX_SERVE_BATCH_PRIORITY environment variable. It defines how the scheduler prioritizes between prefill (context encoding) and decode (token generation) when constructing batches. Options: prefill_first (minimize time-to-first-token), decode_first (minimize inter-token latency), balanced (adaptive based on global queue state), or per_replica (each replica decides independently; default).

  • Diffusion models can now specify a default num_inference_steps per architecture.

  • Added --first-block-caching flag to enable first-block caching (FBCache) for diffusion models like FLUX, and --residual-threshold for the TaylorSeer caching strategy. Both are configurable via max serve and max generate.

  • Enabled logprobs in chat completion responses, returning per-token log probabilities.

  • Non-streaming requests are now cancelled when the client disconnects, preventing zombie requests from consuming KV cache memory.

  • Improved streaming performance by buffering generated tokens and detokenizing them in batches rather than one at a time, reducing CPU overhead and improving GPU utilization.

  • Improved multi-GPU AllReduce performance by launching per-device kernels in parallel async tasks instead of sequentially.

  • Fixed a server hang when a model worker process crashes before it finishes initializing.

  • Fixed per-request seed handling in TopK/TopP sampling. Seeds are now correctly applied per request instead of using a single seed for the entire batch.

  • Fixed KV cache blocks not being released after offline text generation (generate() / generate_async()), which could cause block exhaustion during sustained inference.

  • Fixed three resource leaks in the disaggregated inference decode scheduler: KV cache blocks leaked on request cancellation, replica load-balancing counter drift over time, and a KeyError crash on stale prefill responses arriving after cancellation.

max CLI

  • Added the --device-graph-capture flag to enable CUDA graph capture for serving, reducing per-token latency by replaying recorded GPU kernel launches. Auto-enabled for Llama and DeepSeek V3; opt out with --no-device-graph-capture --force.
  • Added the --debug-verify-replay flag to run eager launch-trace verification before device graph replay, for debugging CUDA graph correctness issues.
  • Added the --kv-cache-format flag to set the KV cache data type at runtime. Accepts float32, bfloat16, or float8_e4m3fn for FP8 quantized caching.
  • Added the --lmcache-config-file flag to enable LMCache-based external KV cache tiering. Point it at an LMCache YAML config to share KV cache blocks across model instances via CPU, disk, or remote storage.
  • Added the --reasoning-parser flag to max serve to enable extraction of model thinking/reasoning content into a separate reasoning field on the OpenAI API response. Currently supports Kimi K2.5 (kimi-k2), with a registry for adding additional parsers.
  • Added the --rejection-sampling-strategy flag to select the rejection sampling method for speculative decoding. Options: greedy, residual (default for standalone), or typical-acceptance (default for EAGLE/MTP). Use greedy for models that pass hidden states.
  • max benchmark now uses the model's default temperature when none is specified.
  • max benchmark no longer overrides top_p unless the user provides a value.
  • Removed the --cache-strategy flag.

Python API

  • Tensor.constant() is deprecated. Use the Tensor(data, dtype=..., device=...) constructor directly, matching PyTorch's torch.tensor() semantics. For example, replace Tensor.constant([1.0, 2.0]) with Tensor([1.0, 2.0]). Tensor.constant() will be removed in a future release.

  • DeviceEvent now accepts an enable_timing=True parameter to enable GPU event timing. Use start.elapsed_time(end) to measure elapsed GPU time in milliseconds between two timing-enabled events.

  • Added the prod op for computing the product of elements along an axis, available as max.graph.ops.prod, max.experimental.functional.prod, and Tensor.prod().

  • Device.stats now includes graph_mem_reserved and graph_mem_used fields for device graph memory observability.

  • Module.compile() now validates weight names, dtypes, and shapes before loading, surfacing mismatches as Python errors instead of runtime crashes during asynchronous host-to-device transfers.

  • InferenceSession now automatically includes the CPU in its device list, removing the need to manually add it when graphs include host-side values.

  • Added max.graph.ops.broadcast for distributed broadcast across devices. Raises ValueError when signal_buffers is empty.

  • Added manual synchronization API (DevicePinnedBuffer, DeviceEvent) for controlling buffer readiness and reducing stream synchronization overhead.

  • Tensor.cast() is now idempotent for same-dtype casts.

  • Added F.cond to the experimental functional API for conditional execution.

  • Added fast path for Tensor.to(device) in eager mode.

  • Added Dim-based scalar dimension API to Module.compile().

  • Module is now device-aware via to() for unified device placement.

  • Module.load_state_dict() now validates weight attribute names.

  • Algebraic dims and graph/custom op construction now works without an explicit context manager, by using a global MLIR context. Threadpool-backed MAX paths now scope worker-thread MLIR usage to the default context automatically.

  • Renamed Float8Config to QuantConfig (and related types/functions) to reflect that the config now covers FP8, NVFP4, and MXFP4 quantization.

  • Renamed related public Python quantization APIs from Float8* names to Quant* names, including parse_float8_config() to parse_quant_config(), and the public quant modules in max.nn and max.pipelines.lib.

  • max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured.

Breaking changes

  • Reorganized max.nn namespace. The graph-based neural network API has been restored as the default max.nn namespace (previously located under max.nn.legacy). The eager module API has moved from max.nn to max.nn.module_v3. Additionally, max.tensor, max.functional, and max.random have moved back under max.experimental (max.experimental.tensor, max.experimental.functional, max.experimental.random). Update imports accordingly.

  • Moved experimental APIs under max.experimental. Two additional packages have moved under the max.experimental namespace to co-locate all experimental APIs:

    • max.torch is now max.experimental.torch. Update imports from from max.torch import CustomOpLibrary, graph_op to from max.experimental.torch import CustomOpLibrary, graph_op.

    • max.nn.module_v3 is now max.experimental.nn (the v3 suffix has been dropped). Update imports from from max.nn.module_v3 import Module, Linear to from max.experimental.nn import Module, Linear.

  • Removed PipelineConfig.max_length. The max_length parameter now resides at the model configuration level as MAXModelConfig.max_length (accessible as config.model.max_length). This change correctly places the parameter at the model level since it describes model capacity (maximum sequence length the model can process), not pipeline runtime behavior. Update all configurations and code to use model.max_length instead of the removed max_length field at the pipeline level.

  • PipelineModel no longer accepts the encoding parameter. The encoding parameter has been removed from PipelineModel.__init__ and all subclasses. The encoding is now automatically inferred from pipeline_config.model.quantization_encoding. This change eliminates redundant parameter passing and ensures a single source of truth for quantization encoding configuration.

  • Device-graph APIs now require explicit caller-provided graph keys for capture/replay/verification. Update calls from model.capture(*inputs), model.replay(*inputs), and model.debug_verify_replay(*inputs) to model.capture(graph_key, *inputs), model.replay(graph_key, *inputs), and model.debug_verify_replay(graph_key, *inputs).

  • Removed q_max_seq_len from KVCacheParams; accepted via graph capture instead.

  • MAXBaseModel now uses extra=forbid and strict=True; configs with unknown fields will be rejected.

  • Replaced disable_auto_sync/mark_as_ready with DevicePinnedBuffer and DeviceEvent for pinned memory management.

MAX kernels

  • Blackwell (SM100) GPU performance:

    • Optimized Attention on SM100 by skipping unnecessary softmax corrections when the row maximum change is small.
    • Fused epilogue into SM100 BF16 and FP8 matmul kernels.
    • Improved SM100 FP8 matmul dispatch for small M shapes (M <= 128).
    • Fixed matmul kernel dispatch on SM100.
    • Added SM100 hardware-accelerated conv2d with TMA im2col and fused residual epilogue for FLUX VAE.
    • Added batched BF16 matmul support for SM100.
    • Added SnapMLA implementation for SM100 MLA decode.
    • Added FP8 tensorwise and block-scale MLA decode for SM100/B200.
    • Added FP8 MMA support for MLA prefill with blockwise scaling and K RoPE.
    • Enabled MLA attention for SM100 GPUs.
    • Enabled 64x256 N split MMA for B200 MLA decode (long context).
    • Used TMA for KV scale loads in attention kernels (SM100).
  • AMD GPU kernel improvements:

    • Tuned and optimized GEMV split-K BF16 dispatch and kernel for AMD GPUs.
    • Enabled FP8 GEMV kernel on AMD GPUs.
    • Reduced K buffer bank conflicts in MHA prefill on AMD via swizzle.
    • Integrated AMD pingpong kernel with FP8 dispatch and fixed TP > 1.
    • Fixed out-of-bounds masking and depths > 256 on AMD RDNA GPUs.
    • Enabled rocSHMEM GDA backend with TCP bootstrap for multi-node AMD EP.
  • Grouped matmul improvements (SM100):

    • Added MMA_N=64 support for 1D1D block-scaled grouped matmul.
    • Added 2SM support to structured 1D1D grouped matmul kernel.
    • Enabled swapAB for block-scaled grouped matmul and block-scaled matmul on SM100.
    • Added tensor scale factor to block-scaled 1D1D grouped matmul.
    • Added bf16 scales support to blockwise FP8 grouped matmul.
  • DeepSeek kernel optimizations:

    • Added BF16 MLA prefill/decode mega-kernel.
    • Enabled BF16 graph execution path for Multi-Latent Attention.
    • Enabled fused QKV projection for latent attention with RoPE.
    • Fused RoPE and RMSNorm into MLA custom ops.
    • Fused epilogue operations in DeepSeek BF16 matmul kernels.
    • Added fused dispatch and combine kernels for expert parallelism.
    • Enabled Mojo BF16 matmul kernels and FP4 kernels for DeepSeek shapes.
    • Fixed blockwise FP8 batched matmul for non-row-major layouts.
  • Multi-GPU distributed ops:

    • Added fused allreduce + RMSNorm + FP8 kernel with residual path and 2-stage allreduce for tensor-parallel workloads.
    • Added distributed scatter graph op for multi-GPU DP>1 inference.
    • Fixed and optimized broadcast kernel for BF16/FP16 with multimem on GPU.
    • Fixed and optimized 2-stage broadcast kernel for multi-GPU.
  • FLUX kernel improvements: Autotuned cuDNN convolution algorithm selection and cached results. Added multi-block GroupNorm GPU kernel. Enabled high-performance Mojo matmul kernels for FLUX.2. Fixed grouped conv2d on GPU incorrectly ignoring the num_groups parameter.

  • kbench now runs benchmarks via shared library (.so) by default, reusing persistent workers and CUDA contexts instead of spawning subprocesses. Benchmark execution phase is ~10x faster (for example, 4.25 h → 0.4 h on a tuning workload). Falls back to subprocess mode when profiling or using custom exec wrappers.

  • Added MXFP4 dequant and matmul kernels.

  • Optimized FP4 matmul dispatch for Llama-style shapes and added FP4 GEMM dispatch configs for additional shape coverage.

  • Used asynchronous FP4 quantization kernel for improved throughput.

  • Optimized Hopper matmul for M=256 and small M shapes via swapAB.

  • Improved GEMV kernel performance. Integrated Flash Infer TopK kernel for improved sampling performance.

  • Improved layer normalization kernel performance.

  • Added FP8 support to FlashMLA decode kernel.

  • Fixed FP8 cast lambda epilogue in matmul.

  • Fixed NaN in MLA decode split-K kernel with causal masking.

  • Fixed warpgroup deadlock in MLA decode that could cause hangs on DeepSeek models.

  • Fixed incorrect MoE expert routing caused by bitonic sort merge direction bug.

  • Fixed int8 matmul dispatch on ARM64.

  • Fixed Metal buffer tracking for sub-buffers and tensor slices on Apple Silicon.

Mojo language

For all the updates to the Mojo language, standard library, and tools, including all GPU programming and Layout/LayoutTensor changes, see the Mojo changelog

Was this page helpful?