Skip to main content

What's new

Here's everything you should know about each release.

Nightly (v26.4)​

This version is still a work in progress.

MAX models​

  • Added NVFP4 quantization support for Gemma 4.
  • Added MXFP4 quantization support for MiniMax-M2.

MAX framework​

Inference server​

  • MAX Serve now emits the maxserve.num_requests_queued OTel/Prometheus metric (changed from an UpDownCounter to a synchronous Gauge). The gauge is sampled once per scheduler iteration from BatchMetrics.publish_metrics and reports the depth of the scheduler's CE / prefill queue (the same value as the Pending: N reqs line in scheduler logs). It is published by every text-path scheduler that drives BatchMetrics: TokenGenerationScheduler and PrefillScheduler (via TextBatchConstructor), and DecodeScheduler (via len(pending_reqs) + len(prefill_reqs)). Operators can use this metric to observe queue buildup during overload conditions.

max CLI​

  • Added --devices=gpu:all to use every visible GPU (including MAX Serve).
  • Removed the default value for --devices; omit --devices to use the model or config default.

Python API​

  • Increased the default allreduce signal buffer size from 513 MiB to 1025 MiB per GPU (max.nn.comm.allreduce.Signals.NUM_BYTES and the matching constant in max.experimental.realization_context). The previous 512 MiB scratch could not hold the per-peer allgather intermediate for models with large hidden dimensions (for example, Kimi-K2.5 at hidden_dim=20480 with max-batch-input-tokens=16384 needs 640 MiB in bf16). This adds ~512 MiB of per-GPU memory use for any multi-GPU model.

  • max.experimental.nn.Module.compile() now emits the same Building and compiling {ClassName}... / Still building... / Building {ClassName} graph took Ns / Compiling {ClassName} took Ms / Building and compiling {ClassName} took Ts log sequence that pipeline-level CompilationTimer produces today, and wraps the compile body in max.profiler.Tracer spans (Module.compile({ClassName}), Module.compile.trace, Module.compile.session_load) so an nsys capture with MODULAR_ENABLE_PROFILING=1 shows compilation as named ranges. Every ModuleV3 caller β€” including pixel-generation pipelines that previously compiled silently β€” now gets this observability for free. The outer CompilationTimer("model") wrappers in *_modulev3 architectures have been removed to avoid nested timing logs.

  • CPUMetricsCollector in max.diagnostics.cpu is now used as a context manager instead of start/stop and now exposes get_stats() instead of dump_stats(), matching the interface of GPUDiagContext.

  • max.graph.Module is now a public class for grouping multiple Graph instances into a single compilation unit, replacing the previous alias for the underlying MLIR module. Construct one with Module() and pass it as the module= argument to each Graph; the resulting Module is what you hand to InferenceSession.load_all to compile every graph together. Graph.empty_module() has been removed in favor of Module(), and Graph now exposes a module property returning the Module it belongs to.

  • InferenceSession.load_all now returns a dict[str, Model] keyed by each model's sym_name (the name of its mo.graph op), instead of a list[Model] ordered by MEF position. The accepted input type also gained max.graph.Module, so callers can compile a pre-built module containing multiple mo.graph ops directly. Model now exposes a name property.

    Migrate positional unpacking call sites by indexing the returned dict:

    # Before
    module = Graph.empty_module()
    with Graph("vision", input_types=..., module=module): ...
    with Graph("language", input_types=..., module=module): ...
    vision_model, language_model = session.load_all(graph, ...)
    
    # After
    module = Module()
    with Graph("vision", input_types=..., module=module) as vision_graph: ...
    with Graph("language", input_types=..., module=module) as language_graph: ...
    models = session.load_all(module, ...)
    vision_model = models[vision_graph.name]
    language_model = models[language_graph.name]

MAX kernels​

  • The use_blocking_impl parameter has been removed from the foreach custom op helper (and the underlying elementwise primitive), and the analogous single_thread_blocking_override parameter has been removed from the concat and concat_shape kernels and the reduction-based kernels. Work is always dispatched the same way, with a single worker used automatically when the problem size is small. The dedicated small-tensor concat fast path has been removed in favor of the existing serial/parallel dispatch.

Breaking changes​

  • KV cache management has moved from max.kv_cache to max.pipelines.kv_cache. Update imports accordingly:

    # Before
    from max.kv_cache import PagedKVCacheManager, DummyKVCache
    
    # After
    from max.pipelines.kv_cache import PagedKVCacheManager, DummyKVCache

    Deprecation shims with DeprecationWarning remain at the old path.

  • GPU and CPU diagnostic tooling has moved from max.diagnostics to max.profiler: max.diagnostics.gpu β†’ max.profiler.gpu and max.diagnostics.cpu β†’ max.profiler.cpu. Update imports accordingly. Deprecation shims with DeprecationWarning remain at the old paths.

  • max/python/max/benchmark/benchmark_throughput.py, deprecated in v0.26.3, has been removed.

Fixes​

  • MODULAR_DEBUG=ir-output-dir=<dir> (and the equivalent [max-debug] ir-output-dir = <dir> config-file entry and InferenceSession.debug.ir_output_dir = <dir> Python setter) now actually dumps per-stage MLIR files to the configured directory. The option was previously parsed but no compiler stage consulted it, so users had to fall back to the legacy MODULAR_MAX_TEMPS_DIR env var. Both spellings are now honored.

Mojo language​

For all the updates to the Mojo language, standard library, and tools, see the Mojo release notes.

v26.3 (2026-05-07)​

Highlights​

  • MAX now supports video generation with Wan 2.1 / 2.2 diffusion models, including image-to-video and video-to-video pipelines.

  • New API for multi-GPU model execution from Python: the max.experimental.sharding module lets a single Module.compile() call distribute a model across a DeviceMesh using Replicated, Sharded, and Partial placement primitives. Gemma 3 ModuleV3 is the first multi-GPU model on this path.

  • The MAX NVFP4 grouped matmul kernel now outperforms FlashInfer on B200 across all tested decoding and prefill shapes for Kimi K2.5.

Documentation​

MAX models​

  • The residual_threshold parameter for FLUX first-block cache (FBCache) is now a per-request runtime parameter on ImageProviderOptions, allowing it to be tuned without recompiling the model graph.

  • Added the Mamba state space model architecture.

  • Added the Step-3.5-Flash architecture.

  • Added the Qwen-Image and Qwen-Image-Edit text-to-image architectures.

  • Added the Z-Image and Z-Image-Turbo text-to-image architectures.

  • MiniMax-M2 and MiniMax-M2.7:

    • Added MiniMax-M2 and MiniMax-M2.7 architecture support, including FP8 weights, the lightning-attention hybrid backbone, and 4Γ—H100 multi-GPU serving.
    • Enabled DP+EP execution paths for MiniMax MoE layers, with automatic overlap scheduling and device-graph capture.
    • Added per-rank token-limit checks and reduced input-offset device round trips on the MiniMax decode path.
  • Gemma 4 and Gemma 3 ModuleV3:

    • Added the Gemma 4 architecture (ModuleV2), including multimodal vision support.
    • Added the Gemma 3 ModuleV3 implementation with multi-GPU support via the DTensor / DistributedTensorType compile path.
    • Fixed token-offset and prompt-image alignment regressions in Gemma 4 multimodal prefill, plus assorted Gemma 3 ModuleV3 performance fixes.
  • Qwen3 and Qwen3-VL:

    • Added Qwen3 and Qwen3-VL architecture support, including the MoE variant and multimodal vision input.
  • Wan video diffusion:

    • Fixed Wan 2.1 / 2.2 video diffusion pipelines silently running without classifier-free guidance. The tokenizer gated negative-prompt tokenization on true_cfg_scale > 1.0 (default 1.0), so negative tokens were never produced and the executor fell back to unguided generation even when guidance_scale > 1.0 and a negative prompt were supplied. Wan now enables classical CFG whenever guidance_scale > 1.0 and defaults an absent negative prompt to the empty string, matching the diffusers baseline.
    • Added the UniPC multistep scheduler for Wan diffusion.
    • Added Wan image-to-video and video-to-video pipeline variants, plus additional generation kwargs and prompt-handling fixes.
  • FLUX.2:

    • Added TaylorSeer denoising cache support to the FLUX.2 Klein pipeline, enabling significant speedups for image-to-image generation by skipping redundant transformer passes during the denoising loop.
    • Added TeaCache support to DiffusionPipeline as a peer of TaylorSeer.
    • Added FLUX.2 ModuleV2 pipeline, FLUX.2 Klein support, NVFP4 quantization, aspect-ratio preserving image preprocessing, and BFL checkpoint weight fixes.
  • Kimi K2.5 vision:

    • Improved Kimi K2.5 multimodal support, including vision encoder fixes and tokenizer parity with the upstream model.
  • DeepSeek V3 and Kimi K2.5 distributed execution:

    • Improved tensor-parallel and expert-parallel execution paths for DeepSeek V3 and Kimi K2.5, including subgraph deduplication, MoE dispatch tuning, and reduced compile-time overhead.

MAX framework​

Inference server​

  • Added periodic "still building/compiling" log messages during model compilation so that long operations produce visible signs of progress.

  • Consolidated KV connector CLI flags (--host-kvcache-swap-space-gb, --disk-offload-dir, --disk-offload-max-gb, --disk-offload-direct-io, --lmcache-config-file) into the --kv-connector-config JSON dict.

  • Removed the --allow-safetensors-weights-fp32-bf16-bidirectional-cast CLI flag. Float32 <-> bfloat16 safetensors weight casting is now unconditionally enabled.

  • Added --model-override CLI flag for per-component ModelManifest overrides (e.g. --model-override transformer.quantization_encoding=float4_e2m1fnx2), enabling mixed quantization in diffusion pipelines.

  • Removed jump forward decoding (compute_ff_tokens) from structured output. The bitmask constraint alone ensures valid structured output, matching the approach used by vLLM and SGLang.

  • Added json_object response-format support to MAX Serve structured output via /v1/chat/completions.

  • Improved error handling for image request failures in MAX Serve.

  • Added multi-step and overlap-scheduler support for structured output in the TextGenerationPipeline. Extended tokenizer support to include TikToken-based tokenizers, enabling structured output with Kimi K2.5.

  • Improved cached-token reporting, fixed cache hit/miss metrics to emit only on context-encoding batches, moved a subset of telemetry from detailed to basic, and added per-draft-position acceptance-rate logging for speculative decoding.

  • Tightened the MODULAR_MAX_SERVE_* environment-variable prefix; unprefixed overrides previously honored by max-serve no longer apply.

  • Added min_p and top_k sampling controls and additional chat-completion kwargs to the OpenAI-compatible routes.

  • Unified EAGLE speculative decoding:

    • Added unified EAGLE pipelines for Llama 3, DeepSeek V3 + MTP, and Kimi K2.5, sharing a single PipelineModel.
    • Added support for --num-speculative-tokens > 1 across the unified EAGLE Llama, DeepSeek+MTP, and Kimi+EAGLE paths.
    • Added overlap-scheduler support for unified EAGLE, including multi-GPU DP setups (e.g. DP Kimi).
    • Enabled CUDA graphs for EAGLE and MTP.
  • Distributed KV transfer (dKV):

    • Added the DKVConnector with NIXL transfer support for the distributed KV cache.
    • Unified KV connector configuration under --kv-connector-config.
    • Added EFA compatibility, disconnect support, parent-hash eviction, and per-connector metrics for the dKV transfer engine.
    • Added a configurable decode-stall watchdog for 1P1D deployments.
    • Added disk-location support to the Python dKV client.
  • Heterogeneous serving and overlap scheduling:

    • Added two-phase prefill execution under the overlap scheduler for the distributed-inference (DI) prefill role.
    • Auto-enabled overlap scheduling for DI pipeline roles and disabled auto device-graph capture for prefill-only workers.
    • Added support for heterogeneous TP prefill / DP decode in MLA KV transfer (e.g. tp4 prefill into a DP decode pool).

max CLI​

  • Added sweep benchmarking capabilities to max benchmark: iterate over multiple concurrency and request-rate combinations, flush the prefix cache between runs, and collect per-run structured JSON results.
  • Standardized the --model flag across max serve, max generate, max encode, and max warm-cache.
  • Improved max serve CLI flag descriptions.

Python API​

  • Added Model.release_captured_graph(), which drops a previously captured device graph identified by graph key (or per-device keys) and frees its device-side working memory once any in-flight replay completes. Releasing a key that was never captured is a no-op. Callers remain responsible for dropping any output Buffer handles returned by the corresponding Model.capture() call.

  • Added ops.roi_align (with F.roi_align functional wrapper) for ROI Align pooling over NHWC inputs, with configurable spatial scale, sampling ratio, alignment mode, and AVG/MAX pooling. Includes a matching MO eager handler.

  • Added MO eager handlers for ConstantExternalOp, ConstantScalarOp, ReduceRmsNormOp, and ReduceGroupNormOp, so graphs with external weights, scalar constants, RMS norm, or group norm run eagerly without falling back to compilation.

  • Fixed tensor slicing with negative integer indices (e.g. hidden[:, -1]) which previously raised a RuntimeError at compile time.

  • Fixed ops.reshape / TensorValue.reshape rejecting valid -1 reshapes on tensors whose leading dim is a symbolic sum-of-products (e.g. [(batch_size * num_steps) + total_seq_len, 1536] reshaped to [-1, n_heads, head_dim] with n_heads * head_dim == 1536). The inferred dim now simplifies without requiring a rebind.

  • Setting MODULAR_MAX_DEBUG_UNINITIALIZED_READ_CHECK=true (or the max-debug.uninitialized-read-check config key, or InferenceSession.debug.uninitialized_read_check = True) enables detection of uninitialized memory reads in Mojo kernels. InferenceSession automatically enables the debug allocator poison and compiles kernels with load-time poison checks for all float types. When a load matches a poison pattern, the process aborts with a descriptive message.

  • Added support for the bfloat16 data type on ARM CPU devices in MAX graphs. Previously, session.load() raised a ValueError when a graph contained bf16 tensors targeting an ARM CPU.

  • Added DevicePlacementPolicy (Ignore, Warn, Error) to Graph to control behavior when CPU-only ops (ops.scatter, ops.cumsum, ops.nonzero, ops.tile) receive GPU tensors. The default (Warn) emits a UserWarning and falls back to CPU; Error raises ValueError instead. ops.cond and ops.while_loop always raise ValueError for GPU predicates.

  • Fixed slow axis=None reductions (mean, sum, prod, max, min) in max.experimental.functional. The previous implementation flattened the tensor before reducing, serializing the work onto a single GPU block. Reductions now iterate axis-by-axis to preserve parallelism.

  • Renamed the public quantization APIs from Float8* to Quant* (including Float8Config β†’ QuantConfig, parse_float8_config() β†’ parse_quant_config(), and the quant modules in max.nn and max.pipelines.lib), reflecting that the config now covers FP8, NVFP4, and MXFP4 quantization.

  • max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured.

  • Introduced CPUMetrics alongside the existing GPU diagnostics and open source it under from max.diagnostics.

  • Added Model.kernel_summaries for inspecting compiled kernels through the Python API.

  • Added a unified DebugConfig Python class (with nanobind bindings) and exposed DebugConfig and GraphDebugConfig in max.engine and max.graph.

  • Added a graph API for initializing and registering the runtime context (M::Context) from Python.

  • Improved max.experimental.functional.custom: compiled custom-op kernels are now cached, and eager-mode F.custom no longer recompiles on every call.

  • Fixed Module.compile() when unrealized tensors are used as weights.

  • Added the InputModality enum for specifying model input types and threaded it through the multimodal pipeline architectures.

  • Updated Tensor.to() and Module.to() to accept distributed device targets, including DeviceMapping and DeviceMesh.

  • max.experimental.Tensor is now distribution-aware: it carries a tuple of per-shard storages, driver.Buffers (realized) or graph values (TensorValue / BufferValue, unrealized), paired with a DeviceMapping that maps those local shards onto the DeviceMesh.

  • Reworked max.experimental.functional from a single functional.py into a functional/ package, a new distribution-and mesh-aware dispatch layer on top of the graph-compiler Python API, split cleanly into three op categories: creation_ops (tensor factories), spmd_ops (rule-based per-op SPMD dispatch), and collective_ops (allreduce_sum, allgather, reduce_scatter etc., now applied per device-group along a chosen mesh axis so they dispatch correctly on multi-dimensional meshes, plus a transfer_to convenience op between DeviceMappings).

  • Added max.experimental.sharding with the core types for distributed tensors (DeviceMesh; DeviceMapping with PlacementMapping and NamedMapping; placement primitives Replicated / Sharded / Partial; DistributedTensorType / DistributedBufferType; TensorLayout), plus a sharding.rules submodule of pure mapping-propagation rules (elementwise, matmul, reduction, shape, conv, pooling) that, for each op, either error out or reshard inputs to the proposed DeviceMappings and derive the resulting output DeviceMapping.

  • max.experimental.nn.Module.compile() now accepts DistributedTensorType symbolic inputs (not just TensorType), so distributed models can be built via the graph-compilation path in addition to running eagerly; gemma3_modulev3 is the first multi-GPU model wired up. DTensor support in MAX is still ongoing work and these APIs may evolve.

  • Added new graph ops (with matching max.experimental.functional wrappers): scatter_max, scatter_min, scatter_mul, scatter_nd_max, scatter_nd_min, scatter_nd_mul, non_maximum_suppression, resize_linear, resize_nearest, and resize_bicubic. The existing max.graph.ops.resize now delegates to these for BILINEAR, NEAREST, and BICUBIC interpolation modes. max.graph.ops.pad (and the functional wrapper) also accepts mode='reflect' and mode='edge' in addition to mode='constant'.

  • Expanded experimental eager-interpreter coverage so significantly more graphs run end-to-end without falling back to compilation. Added handlers for gather, gather_nd, argmax, argmin, split, scatter, scatter_nd, scatter_nd_add, scatter_add, scatter_max, scatter_min, scatter_mul, scatter_nd_max, scatter_nd_min, scatter_nd_mul, tile, band_part, top_k, bottom_k, nonzero, non_maximum_suppression, pad (constant on CPU/GPU; reflect and edge on CPU), conv2d, conv2d_transpose, max_pool2d, avg_pool2d (floor and ceil mode), resize_linear, resize_nearest, resize_bicubic, mo.mutable.store, mo.mutable.store.slice, and the distributed collectives distributed.allreduce.sum, distributed.allgather, distributed.scatter, distributed.broadcast, and distributed.reducescatter.sum. Most run on both CPU and GPU; CPU-only handlers are noted as such.

  • Rewrote the eager-interpreter mo.mutable.store.slice handler to write slices via a device-side Mojo kernel instead of a host numpy round-trip. GPU buffers no longer round-trip Dβ†’Hβ†’D on every call, and bfloat16 and float8_* dtypes are now supported (float4_e2m1fn remains unsupported).

  • Added defensive eager-interpreter handlers for mo.shape.from_tensor, mo.index.to_tensor, mo.buffer.create, mo.buffer.transfer, and mo.gather_sum so eager runs no longer crash if these internal ops survive canonicalization.

  • Improved experimental eager-interpreter performance by enabling multi-threaded CPU execution and removing unnecessary GPU device synchronization between op dispatches.

  • Added max.nn.StackedLinear for QKV-style stacked projections, with a fused (stacked=True) and an unfused (stacked=False) layout. Unfused mode opts into a new Module._omit_module_attr_name flag, which drops the wrapper's own attribute name from descendant weight FQNs, so a self.qkv_proj = StackedLinear(names=["q_proj", "k_proj", "v_proj"], stacked=False) exposes weights at self_attn.q_proj.weight rather than self_attn.qkv_proj.q_proj.weight. This lets HuggingFace checkpoint names flow into models without per-architecture remapping in their weight_adapters.py.

  • Module.compile() now accepts a custom_extensions parameter for loading custom Mojo kernel libraries at graph construction time, fixing validation failures for kernels with struct-level parameters.

  • Fixed torch.compile(fullgraph=True) failing with an "Unsupported context manager" error when accessing CustomOpLibrary ops inside the compiled function. Ops are now eagerly compiled during library initialization.

  • Runtime and device graph performance:

    • Reduced device-graph launch overhead for single-graph models.
    • Parallelized device-graph instantiation and moved instantiation off the main execution threads.
    • Added parallel device-graph launches and a task-ID hint on AsyncRT algorithms.
    • Added a GPU health check during DeviceContext initialization.
    • Added NaN/Inf detection at compiled-region boundaries.
    • Improved Metal driver support with custom statuses and Metal log capture for Apple GPU print output.
    • Made CPUDeviceContext asynchronous and added enqueue_cpu_function / enqueue_cpu_range helpers for CPU kernel execution.
    • Auto-enabled device-graph capture for DeepSeek V3, Kimi, and Kimi K2.5 serving paths.

Custom ops​

  • Added host-function and in-place memcpy custom ops, including mo.launch_host_func, mo.inplace_memcpy, an enqueueHostFunc Mojo binding on DeviceStream, and a cuLaunchHostFunc binding for the CUDA device stream.

MAX kernels​

  • Added GPU kernel examples from the Programming Massively Parallel Processors (PMPP) textbook covering reductions, scans, histograms, sorting, sparse matrix operations, graph algorithms, convolutions, FlashAttention, and more.

  • Improved NVFP4 grouped matmul kernel performance, now outperforming FlashInfer across all tested decoding and prefill shapes for Kimi K2.5 on B200.

  • Optimized GPU layer_norm kernels with SIMD reductions, gamma/beta prefetch, and a simd_width*2 warp tiling dispatch path.

  • Optimized GPU pad_constant kernel with SIMD vectorization (simd_width=4) and added a kbench benchmark suite (bench_pad).

  • Improved GPU topk and argsort kernel performance by nearly 2x.

  • Optimized GPU concat with a flat-indexing kernel that avoids multi-dimensional index decomposition, using 128-bit vectorized loads with automatic fallback for unaligned shapes.

  • Optimized GPU topk stage-1 kernel with a per-thread register heap that caches the top-8 elements during a single scan pass, eliminating redundant global memory re-reads for the first 8 extraction iterations.

  • Moved partial_simd_load and partial_simd_store from buffer.buffer to linalg.utils and removed the buffer package. Update imports from from buffer.buffer import ... to from linalg.utils import ....

  • Blackwell (SM100) GPU performance:

    • Enabled the Mojo SM100 GEMM by default.
    • Added MXFP4 and MXFP8 block-scaled matmul on SM100, plus a KIND_MXF4 execution path.
    • Added a general grouped block-scaled matmul dispatch and MXFP4 support for the grouped path.
    • Enabled PDL for SM100 grouped NVFP4 / MXFP4 / MXFP8 GMM.
    • Improved the SM100 GEMV dispatcher and added GEMV split-K for GEMMs with small M and N.
    • Increased the SM100 GEMM C-tile N dispatch up to 64.
  • AMD GPU performance:

    • Added B300 support, including device-agnostic default block counts for allreduce and allgather.
    • Added a CDNA4 block-scaled MFMA wrapper.
    • Added MI355X TileTensor MHA (about +13% prefill at depth 128) and TileTensor-based AMD attention kernels generally.
    • Always enabled the gfx950 MHA prefill kernel and modernized AMD MHA/MLA decode with 16x16 MMA and FP8.
    • Added depth-512 paths for AMD RDNA GPUs and a 2-D convolution kernel for RDNA 3+ GPUs.
    • Added MXFP4 matmul and grouped matmul support on AMD.
  • Attention and state-space kernels:

    • Added sparse MLA decode (with qbf16 / FP8 KV variants) for SM100.
    • Added speculative-decoding sequence-length folding with numhead for the TP MLA decode dispatch.
    • Added gated delta-rule recurrence kernels for hybrid-attention models.
  • Expert-parallel (EP) kernels:

    • Added multi-device MO ops for EP dispatch and combine.
    • Added a grouped dynamic NVFP4 quantization kernel for MoE.
    • Added MXFP4 support to ep.dispatch and the mo.distributed.ep.dispatch.mxfp4 op.
    • Added a skip_a2a mode to EP dispatch and combine.
    • Fixed AMD GPU atomics in EP dispatch.
  • Collective communication kernels:

    • Unified the multimem and standard code paths in ReduceScatter.
    • Enabled PDL for allgather and updated ReduceScatter to use with_PDL().
    • Launched allgather kernels in parallel and set the allgather block count via a tuning table.
    • Added support for non-multiples of SIMD width in allreduce.
  • Fused transformer kernels:

    • Added a fused rope_split_store kernel and wired it into AttentionWithRope.
    • Added a fused RMSNorm + RoPE GPU kernel and a graph-compiler fusion pattern for mo.reduce.rms_norm.RoPE.
    • Added a GEMV + partial RMSNorm fusion path.

Breaking changes​

  • Removed individual KV connector CLI flags (--host-kvcache-swap-space-gb, --disk-offload-dir, --disk-offload-max-gb, --disk-offload-direct-io, --lmcache-config-file). Use --kv-connector-config with a JSON dict instead.

  • max/python/max/benchmark/benchmark_throughput.py has been deprecated and will be removed in a future MAX release.

  • Removed Dim and DimList types from buffer.dimlist. Custom kernel code using these types should migrate to IntTuple and TileLayout from the layout package.

  • Removed PreTrainedPipelineTokenizer. Use the standard pipeline tokenizer resolution path instead.

  • Moved DenoisingCacheConfig from PipelineConfig to PipelineRuntimeConfig. Update call sites that constructed PipelineConfig(denoising_cache_config=...) to set the field on PipelineRuntimeConfig instead.

  • Replaced FluxPipelineOutput and Flux2PipelineOutput with a unified DiffusionPipelineOutput. Code that imports the old output types must switch to DiffusionPipelineOutput.

  • PipelineConfig now expects a models=ModelManifest(...) configuration for multi-component pipelines (transformer, VAE, text encoders, etc.). Pipelines that previously passed individual model paths or configs at the top level must migrate to a ModelManifest.

  • max-serve now requires the MODULAR_MAX_SERVE_* prefix for environment overrides. Unprefixed environment variables previously honored by max-serve no longer apply.

Fixed​

  • Fixed MAX tools aborting at startup with std::filesystem::filesystem_error when $HOME is not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412)

  • Fixed enqueue_fill() taking O(N) HIP API calls for float64 buffers on AMD GPUs when the high and low 32-bit halves of the fill value differ (e.g., 2.0), reducing the call count to O(log N). (Issue #6417)

  • Fixed integer indexing into a graph tensor (e.g. x[0] on a (2, 3) tensor) failing graph compilation with 'mo.static.reshape' op input and output elements do not match. A reshape-through-slice optimization pattern was incorrectly rewriting the slice + squeeze pattern produced by integer indexing, generating a reshape whose element count did not match the input. (Issue #6440)

Mojo language​

For all the updates to the Mojo language, standard library, and tools, see the Mojo release notes

v26.2 (2026-03-19)​

Highlights​

  • MAX now supports image generation with FLUX diffusion models (FLUX.1-dev and FLUX.2-dev), served through a new /v1/responses endpoint with the OpenResponses API. See the image generation guide to get started.

  • Significant DeepSeek improvements: added support for DeepSeekV3.2 with multi-latent attention, NVFP4 quantization support for DeepSeek-R1 (with expert parallelism), and expert parallelism now supports more than 32 local experts without requiring NVSHMEM for single-node deployments.

  • Major Blackwell (SM100) kernel optimizations, including SnapMLA for MLA decode, hardware-accelerated conv2d with TMA im2col for FLUX VAE, fused epilogues in BF16 and FP8 matmul kernels, and FP8 MMA support for MLA prefill with blockwise scaling.

Documentation​

  • Refactored the MAX Python API reference into a flat list of module pages. Each summary page organizes APIs based on conceptual groups instead of source file locations. All API members also include a direct link to the source code on GitHub.

  • Added Basic operations to the model developer guide, covering tensor arithmetic, shape manipulation, reductions, matrix operations, activation functions, and random tensor generation.

  • Added Model pipeline to the model developer guide, explaining how to connect models to MAX's serving infrastructure with inference pipelines that handle weight loading, KV cache management, request batching, and tokenization.

  • Added Image generation to the inference guide, showing how to generate images from text prompts or transform existing images using the v1/responses endpoint with FLUX models.

  • Added the Environment variables reference, documenting all configurable MAX environment variables for server settings, logging, telemetry, debugging, performance, and Hugging Face integration.

MAX models​

  • Added support for FLUX image generation models (black-forest-labs/FLUX.1-dev and FLUX.2-dev). Supports fused graph compilation, batched VAE decoding, GPU-side post-processing, and first-block caching for repeated prompts.

  • Added support for Kimi vision-language models (moonshotai/Kimi-K2.5 and Kimi-VL-A3B-Instruct). Supports multi-GPU tensor parallelism, a custom vision processor, learnable 2D position embeddings, and tiktoken tokenizer.

  • Added support for OLMo 3 models (Olmo3ForCausalLM), for example allenai/Olmo-3-7B-Instruct.

  • Added support for Qwen3-MoE models (Qwen3MoeForCausalLM), for example Qwen/Qwen3-30B-A3B-Instruct, with multi-GPU tensor parallelism and FP8 quantization support.

  • DeepSeek improvements:

    • Added support for the DeepSeekV3.2 architecture with multi-latent attention and fused FP8 paged KV cache.
    • Added NVFP4 quantization support for DeepSeek-R1, including with expert parallelism.
    • Expert parallelism now supports more than 32 local experts and no longer requires NVSHMEM for single-node deployments.
    • Improved memory estimation for NVFP4-quantized models and EP communication buffers.
    • Added FP4 quantization support for the DeepSeek MTP speculative decoding module.
    • Various fixes: decode-only mode, missing rope_scaling config, DeepSeek-V2-Lite gather-index OOB, re-enabled multi-GPU TP for DeepSeek-V2-Lite-Chat.
  • Removed legacy Gemma 3 multimodal implementation and the MODULAR_MAX_DISABLE_GEMMA3_VISION environment variable.

  • Fixed multi-GPU tensor parallelism for GPT-OSS MoE models.

  • Common MAX models like Qwen 2.5 can now run on AMD RDNA consumer GPUs.

  • Improved Mistral3 text encoder performance by compiling hidden-state selection and eliminating redundant GPU transfers.

  • Fixed prompt validator for Qwen2.5-VL models.

  • Fixed audio generator pipeline to restore audio generation support.

  • Fixed multi-GPU NVFP4 inference for Llama3.

  • Fixed Idefics3 chat template image placeholder ordering.

  • Added MXFP4 quantization support for GPT-OSS models (such as openai/gpt-oss-20b).

MAX framework​

  • Upgraded the bundled libnvptxcompiler from CUDA 12.9 to CUDA 13.1, which requires NVIDIA GPU driver 580 or higher. This brings the latest bug fixes and performance improvements from NVIDIA's PTX compiler, as well as fully supporting new hardware like the DGX Spark and Jetson Thor.

    To use MAX and Mojo with older NVIDIA drivers and hardware, you can set the MODULAR_NVPTX_COMPILER_PATH environment variable to point to a system ptxas binary, instead of using the bundled libnvptxcompiler version.

    The Mojo DeviceContext() constructor now checks NVIDIA driver compatibility at creation time and provides a clear error message when the driver version is too old, matching the behavior of the Python Accelerator() API.

  • Runtime GPU errors now include a Python source traceback, showing where the failing operation was defined in your graph-building code. Build with MODULAR_MAX_DEBUG=True to enable source note collection; when source notes aren't available, error messages include a hint about how to enable them.

  • Added MODULAR_DEBUG_DEVICE_ALLOCATOR environment variable for debugging GPU memory issues. Set to uninitialized-poison to fill buffers with sentinel values (qNaN for floats, 0xCD for others) to detect use of uninitialized data, or out-of-bounds to enable redzone checks for buffer overflows. Accepts a comma-separated list for multiple options.

  • Fixed a memory leak in CUDA graph execution where output buffers were not freed between replays, causing GPU memory to grow over time during sustained inference.

  • Fixed compilation cache misses when cross-compiling GPTQ and LoRA models on machines without a GPU. Weight dtype casting now skips the actual data conversion in virtual device mode, because only compilation metadata is needed.

  • Enabled peer-to-peer device memory access for AMD HIP multi-GPU configurations, enabling direct GPU-to-GPU memory transfers on AMD hardware.

  • Fixed multi-GPU communication silently falling back to a slower transport on systems where rdma-core is installed without dev packages (common in production containers).

  • Fixed multi-GPU broadcast operations failing with "Broadcast currently requires P2P access between GPUs," due to a regression in peer-to-peer device access initialization.

  • Improved Hugging Face model downloads: gated repo errors now surface clearly instead of showing a misleading "check the repo name" messages.

Inference server​

  • Added image generation support via a new /v1/responses endpoint implementing the OpenResponses API standard. Enable it by adding responses to MAX_SERVE_API_TYPES (for example, MAX_SERVE_API_TYPES='["openai","responses"]'). Currently supports FLUX diffusion models. For more information, see the image generation guide.

  • Added output_format parameter to image generation requests, allowing clients to choose JPEG, PNG, or WEBP output per request (default remains JPEG).

  • Overlap scheduling is now auto-enabled for select model architectures like LlamaForCausalLM_Legacy, and is compatible with prefix caching. This reduces CPU overhead by overlapping Python host code with GPU kernel execution. It's currently incompatible with some features such as structured outputs and CPU models. It's still experimental and you can disable it with --no-enable-overlap-scheduler --force.

  • Speculative decoding improvements:

    • Added typical-acceptance rejection sampling.
    • Added rejection-sampling-strategy option (greedy or residual) for speculative decoding. Defaults to residual; use greedy for models that pass hidden states.
    • Applied repetition/frequency/presence penalty sampling in EAGLE.
    • Enabled weight sharing between MTP draft and main model to reduce memory.
    • Added support for chunked prefill with EAGLE and MTP speculative decoding.
    • Fixed batch context length calculation for draft models.
    • Fixed Eagle penalty inputs being unconditionally applied.
  • EAGLE speculative decoding now reports the draft token acceptance rate in scheduler metrics output.

  • Added KV cache offloading: KV cache blocks can now spill from GPU to CPU memory and disk when GPU memory is full, enabling larger effective cache capacity and warm restarts. Includes LMCache integration for sharing KV cache across model instances via external storage (CPU, disk, Redis), with multi-GPU tensor parallelism support.

  • CUDA graph capture is now auto-enabled for Llama models when max_batch_size is set, reducing per-token latency. You can opt out with --no-device-graph-capture --force.

  • Added FP8 quantization support for the KV cache, reducing KV cache memory usage. Configure via --kv-cache-format float8_e4m3fn (also supports float32 and bfloat16).

  • Added configurable batch scheduling strategy for text generation via the MAX_SERVE_BATCH_PRIORITY environment variable. It defines how the scheduler prioritizes between prefill (context encoding) and decode (token generation) when constructing batches. Options: prefill_first (minimize time-to-first-token), decode_first (minimize inter-token latency), balanced (adaptive based on global queue state), or per_replica (each replica decides independently; default).

  • Diffusion models can now specify a default num_inference_steps per architecture.

  • Added --first-block-caching flag to enable first-block caching (FBCache) for diffusion models like FLUX, and --residual-threshold for the TaylorSeer caching strategy. Both are configurable via max serve and max generate.

  • Enabled logprobs in chat completion responses, returning per-token log probabilities.

  • Non-streaming requests are now cancelled when the client disconnects, preventing zombie requests from consuming KV cache memory.

  • Improved streaming performance by buffering generated tokens and detokenizing them in batches rather than one at a time, reducing CPU overhead and improving GPU utilization.

  • Improved multi-GPU AllReduce performance by launching per-device kernels in parallel async tasks instead of sequentially.

  • Fixed a server hang when a model worker process crashes before it finishes initializing.

  • Fixed per-request seed handling in TopK/TopP sampling. Seeds are now correctly applied per request instead of using a single seed for the entire batch.

  • Fixed KV cache blocks not being released after offline text generation (generate() / generate_async()), which could cause block exhaustion during sustained inference.

  • Fixed three resource leaks in the disaggregated inference decode scheduler: KV cache blocks leaked on request cancellation, replica load-balancing counter drift over time, and a KeyError crash on stale prefill responses arriving after cancellation.

max CLI​

  • Added the --device-graph-capture flag to enable CUDA graph capture for serving, reducing per-token latency by replaying recorded GPU kernel launches. Auto-enabled for Llama and DeepSeek V3; opt out with --no-device-graph-capture --force.
  • Added the --debug-verify-replay flag to run eager launch-trace verification before device graph replay, for debugging CUDA graph correctness issues.
  • Added the --kv-cache-format flag to set the KV cache data type at runtime. Accepts float32, bfloat16, or float8_e4m3fn for FP8 quantized caching.
  • Added the --lmcache-config-file flag to enable LMCache-based external KV cache tiering. Point it at an LMCache YAML config to share KV cache blocks across model instances via CPU, disk, or remote storage.
  • Added the --reasoning-parser flag to max serve to enable extraction of model thinking/reasoning content into a separate reasoning field on the OpenAI API response. Currently supports Kimi K2.5 (kimi-k2), with a registry for adding additional parsers.
  • Added the --rejection-sampling-strategy flag to select the rejection sampling method for speculative decoding. Options: greedy, residual (default for standalone), or typical-acceptance (default for EAGLE/MTP). Use greedy for models that pass hidden states.
  • max benchmark now uses the model's default temperature when none is specified.
  • max benchmark no longer overrides top_p unless the user provides a value.
  • Removed the --cache-strategy flag.

Python API​

  • Tensor.constant() is deprecated. Use the Tensor(data, dtype=..., device=...) constructor directly, matching PyTorch's torch.tensor() semantics. For example, replace Tensor.constant([1.0, 2.0]) with Tensor([1.0, 2.0]). Tensor.constant() will be removed in a future release.

  • DeviceEvent now accepts an enable_timing=True parameter to enable GPU event timing. Use start.elapsed_time(end) to measure elapsed GPU time in milliseconds between two timing-enabled events.

  • Added the prod op for computing the product of elements along an axis, available as max.graph.ops.prod, max.experimental.functional.prod, and Tensor.prod().

  • Device.stats now includes graph_mem_reserved and graph_mem_used fields for device graph memory observability.

  • Module.compile() now validates weight names, dtypes, and shapes before loading, surfacing mismatches as Python errors instead of runtime crashes during asynchronous host-to-device transfers.

  • InferenceSession now automatically includes the CPU in its device list, removing the need to manually add it when graphs include host-side values.

  • Added max.graph.ops.broadcast for distributed broadcast across devices. Raises ValueError when signal_buffers is empty.

  • Added manual synchronization API (DevicePinnedBuffer, DeviceEvent) for controlling buffer readiness and reducing stream synchronization overhead.

  • Tensor.cast() is now idempotent for same-dtype casts.

  • Added F.cond to the experimental functional API for conditional execution.

  • Added fast path for Tensor.to(device) in eager mode.

  • Added Dim-based scalar dimension API to Module.compile().

  • Module is now device-aware via to() for unified device placement.

  • Module.load_state_dict() now validates weight attribute names.

  • Algebraic dims and graph/custom op construction now works without an explicit context manager, by using a global MLIR context. Threadpool-backed MAX paths now scope worker-thread MLIR usage to the default context automatically.

  • Renamed Float8Config to QuantConfig (and related types/functions) to reflect that the config now covers FP8, NVFP4, and MXFP4 quantization.

  • Renamed related public Python quantization APIs from Float8* names to Quant* names, including parse_float8_config() to parse_quant_config(), and the public quant modules in max.nn and max.pipelines.lib.

  • max.diagnostics.gpu.BackgroundRecorder's sampling interval can now be configured.

Breaking changes​

  • Reorganized max.nn namespace. The graph-based neural network API has been restored as the default max.nn namespace (previously located under max.nn.legacy). The eager module API has moved from max.nn to max.nn.module_v3. Additionally, max.tensor, max.functional, and max.random have moved back under max.experimental (max.experimental.tensor, max.experimental.functional, max.experimental.random). Update imports accordingly.

  • Moved experimental APIs under max.experimental. Two additional packages have moved under the max.experimental namespace to co-locate all experimental APIs:

    • max.torch is now max.experimental.torch. Update imports from from max.torch import CustomOpLibrary, graph_op to from max.experimental.torch import CustomOpLibrary, graph_op.

    • max.nn.module_v3 is now max.experimental.nn (the v3 suffix has been dropped). Update imports from from max.nn.module_v3 import Module, Linear to from max.experimental.nn import Module, Linear.

  • Removed PipelineConfig.max_length. The max_length parameter now resides at the model configuration level as MAXModelConfig.max_length (accessible as config.model.max_length). This change correctly places the parameter at the model level since it describes model capacity (maximum sequence length the model can process), not pipeline runtime behavior. Update all configurations and code to use model.max_length instead of the removed max_length field at the pipeline level.

  • PipelineModel no longer accepts the encoding parameter. The encoding parameter has been removed from PipelineModel.__init__ and all subclasses. The encoding is now automatically inferred from pipeline_config.model.quantization_encoding. This change eliminates redundant parameter passing and ensures a single source of truth for quantization encoding configuration.

  • Device-graph APIs now require explicit caller-provided graph keys for capture/replay/verification. Update calls from model.capture(*inputs), model.replay(*inputs), and model.debug_verify_replay(*inputs) to model.capture(graph_key, *inputs), model.replay(graph_key, *inputs), and model.debug_verify_replay(graph_key, *inputs).

  • Removed q_max_seq_len from KVCacheParams; accepted via graph capture instead.

  • MAXBaseModel now uses extra=forbid and strict=True; configs with unknown fields will be rejected.

  • Replaced disable_auto_sync/mark_as_ready with DevicePinnedBuffer and DeviceEvent for pinned memory management.

MAX kernels​

  • Blackwell (SM100) GPU performance:

    • Optimized Attention on SM100 by skipping unnecessary softmax corrections when the row maximum change is small.
    • Fused epilogue into SM100 BF16 and FP8 matmul kernels.
    • Improved SM100 FP8 matmul dispatch for small M shapes (M <= 128).
    • Fixed matmul kernel dispatch on SM100.
    • Added SM100 hardware-accelerated conv2d with TMA im2col and fused residual epilogue for FLUX VAE.
    • Added batched BF16 matmul support for SM100.
    • Added SnapMLA implementation for SM100 MLA decode.
    • Added FP8 tensorwise and block-scale MLA decode for SM100/B200.
    • Added FP8 MMA support for MLA prefill with blockwise scaling and K RoPE.
    • Enabled MLA attention for SM100 GPUs.
    • Enabled 64x256 N split MMA for B200 MLA decode (long context).
    • Used TMA for KV scale loads in attention kernels (SM100).
  • AMD GPU kernel improvements:

    • Tuned and optimized GEMV split-K BF16 dispatch and kernel for AMD GPUs.
    • Enabled FP8 GEMV kernel on AMD GPUs.
    • Reduced K buffer bank conflicts in MHA prefill on AMD via swizzle.
    • Integrated AMD pingpong kernel with FP8 dispatch and fixed TP > 1.
    • Fixed out-of-bounds masking and depths > 256 on AMD RDNA GPUs.
    • Enabled rocSHMEM GDA backend with TCP bootstrap for multi-node AMD EP.
  • Grouped matmul improvements (SM100):

    • Added MMA_N=64 support for 1D1D block-scaled grouped matmul.
    • Added 2SM support to structured 1D1D grouped matmul kernel.
    • Enabled swapAB for block-scaled grouped matmul and block-scaled matmul on SM100.
    • Added tensor scale factor to block-scaled 1D1D grouped matmul.
    • Added bf16 scales support to blockwise FP8 grouped matmul.
  • DeepSeek kernel optimizations:

    • Added BF16 MLA prefill/decode mega-kernel.
    • Enabled BF16 graph execution path for Multi-Latent Attention.
    • Enabled fused QKV projection for latent attention with RoPE.
    • Fused RoPE and RMSNorm into MLA custom ops.
    • Fused epilogue operations in DeepSeek BF16 matmul kernels.
    • Added fused dispatch and combine kernels for expert parallelism.
    • Enabled Mojo BF16 matmul kernels and FP4 kernels for DeepSeek shapes.
    • Fixed blockwise FP8 batched matmul for non-row-major layouts.
  • Multi-GPU distributed ops:

    • Added fused allreduce + RMSNorm + FP8 kernel with residual path and 2-stage allreduce for tensor-parallel workloads.
    • Added distributed scatter graph op for multi-GPU DP>1 inference.
    • Fixed and optimized broadcast kernel for BF16/FP16 with multimem on GPU.
    • Fixed and optimized 2-stage broadcast kernel for multi-GPU.
  • FLUX kernel improvements: Autotuned cuDNN convolution algorithm selection and cached results. Added multi-block GroupNorm GPU kernel. Enabled high-performance Mojo matmul kernels for FLUX.2. Fixed grouped conv2d on GPU incorrectly ignoring the num_groups parameter.

  • kbench now runs benchmarks via shared library (.so) by default, reusing persistent workers and CUDA contexts instead of spawning subprocesses. Benchmark execution phase is ~10x faster (for example, 4.25 h β†’ 0.4 h on a tuning workload). Falls back to subprocess mode when profiling or using custom exec wrappers.

  • Added MXFP4 dequant and matmul kernels.

  • Optimized FP4 matmul dispatch for Llama-style shapes and added FP4 GEMM dispatch configs for additional shape coverage.

  • Used asynchronous FP4 quantization kernel for improved throughput.

  • Optimized Hopper matmul for M=256 and small M shapes via swapAB.

  • Improved GEMV kernel performance. Integrated Flash Infer TopK kernel for improved sampling performance.

  • Improved layer normalization kernel performance.

  • Added FP8 support to FlashMLA decode kernel.

  • Fixed FP8 cast lambda epilogue in matmul.

  • Fixed NaN in MLA decode split-K kernel with causal masking.

  • Fixed warpgroup deadlock in MLA decode that could cause hangs on DeepSeek models.

  • Fixed incorrect MoE expert routing caused by bitonic sort merge direction bug.

  • Fixed int8 matmul dispatch on ARM64.

  • Fixed Metal buffer tracking for sub-buffers and tensor slices on Apple Silicon.

Mojo language​

For all the updates to the Mojo language, standard library, and tools, including all GPU programming and Layout/LayoutTensor changes, see the Mojo changelog

v26.1 (2026-01-29)​

Highlights​

The eager-style Tensor and Module APIs are now the primary API for model development, providing a PyTorch-like development experience:

from max import functional as F
from max.tensor import Tensor
from max.dtype import DType

x = Tensor.constant([1.0, -2.0, 3.0, -4.0, 5.0], dtype=DType.float16)
y = F.relu(x)
print(y)
# Tensor([1 0 3 0 5], dtype=DType.float16, device=Device(type=gpu,id=0))

If you want explicit control over the graph structure, you can still build models with the Graph APIs.

For more details, see the model developer guide.

Documentation​

MAX models​

  • Gemma3 now supports vision input (multimodal) in the 12B and 27B variants, including support for local file paths and structured output. Learn more in the image to text guide.

  • Added Qwen/Qwen3-VL-4B-Instruct and Qwen/Qwen3-VL-2B-Instruct model architectures.

  • Removed Llama 3.2 Vision (Llama-3.2-11B-Vision-Instruct) architecture support. Use other vision models such as Pixtral, InternVL, Qwen2.5-VL, and Gemma3.

MAX framework​

  • All Python wheels are now hosted at https://whl.modular.com/nightly/simple/. If using uv, change --index-url to --index, and if using pip, change to --extra-index-url. For precise commands, see the install guide.

Inference server​

  • Improved scheduling to achieve higher KVCache utilization and batch sizes. By default, MAX now schedules a context encoding (CE) request only if KVCache memory is less than 95% full after allocating blocks for that request or if no active requests exist. You can adjust this watermark value (0.95) with --kvcache-ce-watermark. Beware that increasing it causes more preemptions.

  • When running models with data-parallelism (DP), the semantics of max batch size has changed. For example, when specifying --data-parallel-degree 8 and --max-batch-size 32 it previously meant that each data-parallel replica could have at most 4 requests for an aggregate max batch size of 32. We changed this so that now the CLI flag specifies the max batch size per replica. This means the aggregate max batch size of the above values is 8*32=256 requests. This aligns with vLLM and other inference engines.

  • --max-ce-batch-size is now deprecated. The cap on batch size is now uniform between context encoding and token generation phases of text generation. Use --max-batch-size instead.

  • The API server now returns chunked tokens from the model worker, reducing overhead and significantly improving throughput for small models and decode-heavy workloads.

  • Server stats collection (collect_server_stats) is now enabled by default for serving benchmarks.

max CLI​

  • The max generate command now applies the model's chat template internally when using --prompt. This more closely aligns with how users typically prompt a model for testing and ensures special tokens are properly filtered from output.

  • Added tracing flags to max benchmark for nsys profiling:

    • --trace: Enable tracing of the benchmark run (currently NVIDIA GPUs only)
    • --trace-file: Path to save the trace file
    • --trace-session: Optional session name for tracing

    Requires the server to be run under nsys launch. Using --gpu-profiling detailed is recommended.

Python API​

  • The eager-style Tensor APIs are now the primary API for model development, providing a PyTorch-like development experience.

    We moved the eager-style tensor APIs out of experimental and reorganized the max.nn module to make the eager module system the primary API (nn.module_v3 is now nn.module).

    The previous max.nn components are still available for backward compatibility in max.nn.legacy.

  • Renamed max.driver.Tensor to max.driver.Buffer to clarify that it represents a low-level memory buffer, not a tensor. The max.tensor.Tensor class remains the primary tensor type.

  • Added forward() method to Module to compute the outputβ€”it behaves the same as invoking the object as a callable (the __call__() method).

  • accelerator_count() now returns a non-zero value when called on an Apple silicon system. This means you can use this code:

    device = CPU() if accelerator_count() == 0 else Accelerator()

    And it defaults to using the available Apple silicon GPU. As a consequence, MAX graphs should in most cases be dispatched to run on Apple silicon GPUs. Note that most MAX models do not yet work on Apple silicon GPUs due to missing hardware-specific kernel pathways and other support, but this is an important step towards enabling MAX more broadly on Apple silicon GPUs.

  • Added max.nn.module.rope containing rotary embedding implementations, RotaryEmbedding and TransposedRotaryEmbedding.

  • Added ArchConfig and ArchConfigWithKVCache. Going forward, models that register with the MAX architecture registry must define a config that implements this protocol

  • Added ops.complex.mul for multiplying complex-valued tensors

  • Added calculate_virtual_device_count(), calculate_virtual_device_count_from_cli(), load_max_buffer() to max.driver.

  • Added TokenBuffer for token management.

  • Renamed prefill_chunk_size to max_batch_input_tokens and max_batch_context_length to max_batch_total_tokens in PipelineConfig and TTSConfig classes to better reflect their purpose in batch memory management.

    The corresponding CLI flags have also been renamed: --prefill-chunk-size is now --max-batch-input-tokens and --max-batch-context-length is now --max-batch-total-tokens.

  • Fixed max.driver.Buffer.to(stream) to not copy (it return reference to the same tensor) when the stream is on the same device, even for GPU-pinned host memory.

  • Removed deprecated max.nn convolution classes: Conv2dV1, Conv1DV1, Conv3DV1. Use Conv2d, Conv1D, Conv3D instead.

  • Removed deprecated max.nn layer classes: LinearV1, QLinearV1, GPTQLinearV1, MLPV1, EmbeddingV1, LayerNormV1, RMSNormV1. Use Linear, GPTQLinear, MLP, Embedding, LayerNorm, RMSNorm instead.

  • Removed max.engine.MojoValue

  • Removed the deprecated custom_ops_path parameter from InferenceSession.load(). Instead use the custom_extensions parameter.

  • Added graph.ops.shard_and_stack()

  • Removed unused graph.weights.PytorchWeights

MAX kernels​

  • Improved performance for Hopper Matmul when using skinny M shapes. In particular when M is between 2 and 64, we see a significant performance boost for specific shapes ranging between 10 - 40%.

  • Added swapAB optimization to Hopper Matmul, performs B x A and does a transposed write to C. This helps when you need more granularity in the M dimension.

  • Refined create_stream API: all streams are now non-blocking (blocking argument has been removed). Explicitly use DeviceEvent and synchronize() wherever necessary.

Mojo language​

For all the updates to the Mojo language, standard library, and tools, including all GPU programming and Layout/LayoutTensor changes, see the Mojo changelog

v25.7 (2025-11-20)​

Highlights​

Documentation​

  • New online book to build an LLM from scratch with MAX, using our experimental model APIs. This is a guided lesson to building GPT-2 with our Python API, explaining each component of the transformer model along the way. Like the Python APIs, the book is a work in progressβ€”please report any issues in GitHub.

  • All the planned parts of GPU Puzzles are now complete! Support for Apple silicon GPUs is also making steady progress.

  • Tutorials on docs.modular.com are now integrated into the Guides section, indicated with a book icon in the left navigation.

  • The max CLI docs are now generated from the CLI source.

MAX models​

  • Gemma3 now supports logprobs.

MAX framework​

  • Added support for bfloat16 models running on GPUs with ARM-based CPU hosts, such as Grace Hopper (GH200) and Grace Blackwell (GB200).
  • Updated minimum NVIDIA GPU driver requirement to 580.

max CLI​

  • max benchmark can now run LoRA benchmarking for supported models and target modules.

  • max benchmark --collect-gpu-stats can now collect AMD GPU statistics.

  • max serve --do-penalties was renamed to --enable-penalties and enabled by default. To disable penalties, you can specify --no-enable-penalties

Python API​

  • Added support for Python 3.14.

  • Removed support for Python 3.9.

  • All MAX Python API modules are now open-sourced. In addition to those previously released, we've added driver, dtype, engine, experimental, interfaces, kv_cache, mlir, nn, profiler, support, torch, and more in our GitHub repo.

  • Added max.profiler module with the Tracer class to create and manage profiling spans based on runtime conditions, and the [`@traced()] decorator to profile a whole function.

  • Added max.diagnostics.gpu APIs to expose common GPU statistics as might be reported by nvidia-smi or rocm-smi.

  • Added the max.kv_cache package, which provides APIs to manage key-value caches used in transformer models. Not to be confused with the existing max.nn.kv_cache package that includes kernels for KV caching.

  • Removed the KVCacheManager class and combined it with the single PagedKVCacheManager implementation. During merger, prefetch() was renamed maybe_reserve().

  • Added NullKVCacheManager for compile-only mode, which avoids GPU memory allocation when compiling models without a physical GPU present.

  • Added ResetPrefixCacheBackend and ResetPrefixCacheFrontend classes for coordinating prefix cache resets between frontend and backend components.

  • Added more APIs for text-to-speech (TTS) models such as AudioGenerationInputs and AudioGenerationOutput

  • Changed LoRAConfig.max_num_loras default to 1 (was 100).

  • New RequestID class replaces previous type alias to provide better type safety and consistency across the API.

  • Removed InputContext and replaced it with the modality-output specific TextGenerationContext and EmbeddingsContext.

  • Added ImageMetadata and VLMTextGenerationContext.

  • Added max.nn.comm with Allreduce and Signals for peer-to-peer communication in allreduce.

  • ops.gather() no longer has a default axis, it must be specified explicitly (better matching PyTorch and NumPy).

  • Graph.add_subgraph() has been updated to take a devices argument. This allows subgraphs to take advantage of device-aware work scheduling.

Mojo API​

  • Renamed the tensor_internal package to tensor and removed the previous tensor stubβ€”the API behaves the same but the Mojo tensor docs moved.

Mojo language​

For all the updates to the Mojo language, standard library, and tools, including all GPU programming and Layout/LayoutTensor changes, see the Mojo changelog.

v25.6.1 (2025-10-10)​

Fixes a latency regression due to a top-k algorithm change and a couple other benchmarking bugs.

v25.6 (2025-09-22)​

Highlights​

  • MAX delivers state-of-the-art performance on NVIDIA Blackwell (B200)!

    We've been describing our Blackwell bring-up over a series of blog posts, and we recently published Part 4: Breaking SOTA, in which we share our latest matmul benchmarks compared to NVIDIA's cuBLAS library.

  • MAX provides industry-leading performance on AMD MI355X!

    In a matter of weeks, we got MAX running on the brand new MI255X system and have already produced early benchmarks that go head-to-head with Blackwell. If you have access to an MI355X, you can try it yourself today by following our quickstart guide.

  • Benchmarking endpoints is easier than ever before the new max benchmark command, which accepts YAML configuration files so you can easily share and reproduce your benchmarks.

Documentation​

  • Our new quickstart guide lets you pick the model architecture and size you want, and then shows you how to deploy it and run our open-source benchmarking script, all from the max CLI.

  • We updated and simplified the benchmarking tutorial to use the new max benchmark command.

MAX models​

MAX framework​

  • Added device-aware work scheduling for AsyncRT: work items can now specify a deviceHint to route execution to specific worker threads based on device affinity, improving multi-device performance.

  • Improved code quality by enabling large set of RUFF lints, including flake8-annotations (ANN) which now enforces Python type annotations for new contributions.

Inference server​

  • Added support for data parallelism in Llama models. To enable this feature, use the --data-parallel-degree option:

    max serve --model $MODEL_ID --data-parallel-degree 2 --devices gpu:0,1
  • Metrics for each context encoding and token generation batch are now logged to the console periodically. We can override the default frequency (3 seconds) of such logs via setting the MAX_SERVE_SCHEDULER_STATS_LOG_INTERVAL_S flag. For example, setting MAX_SERVE_SCHEDULER_STATS_LOG_INTERVAL_S=0 will log metrics for all batches.

  • Improved error messages when pulling a model that requires more RAM than what's available or when there won't be enough RAM left for the KV cache.

max CLI​

  • Added the max benchmark subcommand that runs a suite of benchmarks and collects performance metrics on a model server. This command provides convenient packaging/installation for our open-source benchmark_serving.py script and accepts all the same options.

  • Added --chat-template to the CLI for passing a custom chat templates defined in Jinja2 template files.

  • Renamed the --allow-safetensors-weights-float32-to-bfloat16-cast flag to --allow-safetensors-weights-fp32-bf6-bidirectional-cast, which supports automatic bidirectional dtype casts when needed.

  • The max generate command now supports --top-k, --temperature, and --seed flags.

  • Changed --num-warmups behavior. Previously, it ran the model on the prompt N times, generating until reaching a stop condition each time. Now it runs the model for N steps, generating N new tokens as a warmup.

  • Added the --model option as a preferred alternative to --model-path. They behave the same.

  • Deprecated --pad-to-multiple-of.

  • Removed the previously deprecated --model-name. Use --served-model-name instead.

Python API​

  • Removed the previously deprecated KVCacheStrategy.CONTINUOUS and all associated classes (including ContinuousBatchingKVCacheManager).

  • Added ops.fence, a pure identity operation that prevents the async runtime from reordering operations across it. This operation is essential for implementing cross-device synchronization.

  • Removed PipelineConfig.max_new_tokens. Use SamplingParams.max_new_tokens instead.

  • Added logits_processor to SamplingParams for updating logits in-place during each step of token generation.

  • Added generate() to TextGenerationPipeline and StandaloneSpeculativeDecodingPipeline, a convenience method for getting text generations. generate_async() is available for getting streamed outputs.

  • Renamed the target_num_new_tokens configuration parameter to prefill_chunk_size in PipelineConfig and TTSConfig classes to better reflect its role in chunked prefill operations.

  • Fixed ops.range to respect the dtype parameter when using Dim objects as inputs. Previously, the dtype was ignored and defaulted to int64.

  • Made the devices argument in InferenceSession() required. To maintain the previous default behavior, use InferenceSession(devices=[CPU()]).

  • Added an optional logging argument to InferenceSession(). When set to "op", this option enables operation launch output to stderr.

  • Added max.nn.lora, providing Low-Rank Adaptation (LoRA) support for parameter-efficient fine-tuning of neural network models.

  • Added max.nn.moe, implementing Mixture of Experts (MoE) layers for scalable model architectures.

  • Added max.nn.sampling, containing advanced sampling methods including MinP and rejection sampling techniques.

  • Added max.nn.hooks, providing debugging and inspection hooks for neural network layers.

  • Added attention submodules max.nn.attention.mask_config, max.nn.attention.multihead_attention, and max.nn.attention.multi_latent_attention for comprehensive attention mechanism configuration and implementation.

  • Moved some Mojo-related functionality to a new top-level mojo Python namespace. Specifically, max.mojo (previously used for Mojo-Python interop), some of max.support, and max.entrypoints.mojo now live under the mojo namespace and are provided in the new mojo package.

MAX kernels​

  • Added a leaky ReLU activation function kernel.

  • Added a specialized RMS norm function kernel for the common case of cols=128, bfloat16.

Mojo language​

For all the updates to the Mojo language, standard library, and tools, including all GPU programming changes, see the Mojo changelog.

v25.5 (2025-08-05)​

Highlights​

  • OpenAI-compatible batch API: The /v1/batches API is now available with Mammoth.

    We recently announced a partnership with SF Compute to make this API available through their dynamic GPU pricing marketplace. Their Large Scale Inference Batch API looks different from the /v1/batches API in Mammoth because it's a superset.

  • New mojo Conda package: For Mojo-specific projects that run on CPUs and GPUs, you can now install the bare essentials with the mojo Conda package that's less than 900 MB on disk. For example, this now works:

    pixi add mojo

    The mojo Python package is not available for pip/uv yet.

    For a complete model-development and serving toolkit, you should still install the modular package (which includes mojo as a dependency).

  • Open-source graph APIs: We've added the max.graph Python APIs to our GitHub repo. We've made great strides in recent months to simplify these APIs that help you build high-performance models you can serve with MAX.

Documentation​

MAX models​

MAX framework​

  • Removed all torch package dependencies.

    • Reduces the total installation size of modular (including dependencies) from 2.2 GB for CPUs and 6.5 GB for GPUs down to 1.5 GB, for all Python packages. Conda packages pull additional system dependencies so sizes may vary, but one example brings the size down from 9.8 GB to 2.0 GB.

    • pip install no longer requires the --extra-index-url https://download.pytorch.org/whl/cpu option (which was to avoid installing the GPU version of torch that has a lot of CUDA dependencies).

    • uv pip install no longer requires the --index-strategy unsafe-best-match option (which was to avoid package resolution issues with the above --extra-index-url option).

  • Removed HuggingFace fallback for model pipelines not natively supported in MAX (PipelineEngine.HUGGINGFACE), because it's almost never used and it creates significant tech debt.

Inference server​

  • Added the /health endpoint for service readiness checks, used by tools like lm-eval to determine when the service is ready to accept requests.

  • Prefix caching now uses a Mojo token hashing operation. Previously we used the hash() method from the Python stdlib. However, this resulted in noticeable CPU overhead and reduced GPU utilization. In this release, we migrated the token hashing operation to an accelerated Mojo implementation.

  • Re-implemented the OpenAI API's logprobs and echo request parameters to eliminate an expensive device transfer. The --enable-echo flag, which previously incurred a significant performance penalty, is now 9-12x faster.

  • Added support for file:// URIs in image inputs for multimodal models. Local file access is controlled via the MAX_SERVE_ALLOWED_IMAGE_ROOTS environment variable, which specifies a list of allowed root directories. Files are read asynchronously using aiofiles for better performance under high load.

  • Improved function calling (tool use) to more reliably extract JSON tool calling responses for Llama models in an OpenAI-compatible format.

  • Switched from XGrammar to llguidance for generating structured output (constrained decoding).

max CLI​

  • Added --vision-config-overrides CLI option to override vision model configuration parameters. For example, to decrease InternVL's maximum dynamic patches from 12 to 6:

    max serve --model-path OpenGVLab/InternVL3-38B-Instruct \
      --vision-config-overrides '{"max_dynamic_patch": 6}'
  • Removed --ignore-eos CLI argument. The full set of OpenAI chat and completion sampling parameters are now supported in the http requests. As such, the parameter can just be set via the http payload.

Python API​

  • Added the max.interfaces module. This module should serve as a relatively import free module to hold all shared interfaces across the MAX stack. Slowly we will be moving common interfaces to this module. So far, we've moved the following from max.pipelines.core:

    • Moved TextGenerationStatus, TextResponse, TextGenerationResponse, InputContext, and PipelineTask into max.interfaces.

    • Moved all TokenGeneratorRequest-prefixed objects into max.interfaces and renamed with the TextGenerationRequest prefix.

    • Moved TextGenerationStatus to GenerationStatus.

    • Moved TextResponse and TextGenerationResponse to TextGenerationOutput.

    • Moved EmbeddingsResponse to EmbeddingsOutput.

  • Added ops.scatter_nd operation for scattering updates into a tensor at specified indices.

  • Added ops.avg_pool2d and ops.max_pool2d.

  • Added max.torch.graph_op interface to make it simple to embed larger MAX computations and models inside PyTorch. These can use max.nn modules internally and may be used within torch.nn modules, allowing the use of MAX subcomponents for access to our high performance graph compiler and Mojo kernel library.

    import torch
    import numpy as np
    import max
    from max.dtype import DType
    from max.graph import ops
    
    @max.torch.graph_op
    def max_grayscale(pic: max.graph.TensorValue):
        scaled = pic.cast(DType.float32) * np.array([0.21, 0.71, 0.07])
        grayscaled = ops.sum(scaled, axis=-1).cast(pic.dtype)
        # max reductions don't remove the dimension, need to squeeze
        return ops.squeeze(grayscaled, axis=-1)
    
    @torch.compile
    def grayscale(pic: torch.Tensor):
        output = pic.new_empty(pic.shape[:-1])  # Remove color channel dimension
        max_grayscale(output, pic)  # Call as destination-passing style
        return output
    
    img = (torch.rand(64, 64, 3, device=device) * 255).to(torch.uint8)
    result = grayscale(img)
  • Moved AlgebraicDim, Dim, StaticDim, and SymbolicDim out of max.type and into max.graph.dim. You can still import them directly from max.graph.

  • Moved Shape out of max.type and into max.graph.shape. You can still import it directly from max.graph.

  • Removed the ability to pass Python objects into models and have them returned as Mojo PythonObject types in the kernels.

  • Removed RandomWeights.

  • Removed Model.execute_legacy(). Instead use the standard execute() or __call__() methods.

  • Removed TorchScript-related helper functions and APIs, including support for .pt TorchScript files in custom extensions.

Mojo language​

For all the updates to the Mojo language, standard library, and tools, including all GPU programming changes, see the Mojo changelog.

v25.4 (2025-06-18)​

✨ Highlights​

  • AMD GPUs are officially supported!

    You can now deploy MAX with acceleration on AMD MI300X and MI325X GPUs, using the same code and container that works on NVIDIA GPUs. For the first time, you can build portable, high-performance GenAI deployments that run on any platform without vendor lock-in or platform-specific optimizations.

    For more details, including benchmarks, see our Modular + AMD blog post.

  • Now accepting GPU kernel contributions

    Last month, we open-sourced the code for the CPU and GPU kernels that power the MAX framework, and now we're accepting contributions! For information about how to contribute and the sort of kernels most interesting to us, see the MAX AI kernels contributing guide.

  • Preview: Mojo interoperability from Python

    This release includes an early version of a new Python-to-Mojo interoperability API. You can now write just the performance-critical parts your code in Mojo and call it from Python just like you're importing another Python library. Check out our docs to call Mojo from Python.

Documentation​

We've redesigned builds.modular.com and docs.modular.com with a unified top navigation bar that so you can more easily discover all the available docs and code resources.

New docs:

Major updates:

MAX models​

  • Added the OLMo 2 model architecture (olmo2).

    Try OLMo 2 now.

  • Added Google's Gemma 3 multimodal model architecture (gemma3multimodal).

    Try Gemma3 now.

  • Added the Qwen 3 model architecture (qwen3).

    Try Qwen3 now.

  • Added the InternVL3 model architecture (internvl). This is still a work in progress.

  • GGUF quantized Llamas (q4_0, q4_k, and q6_k) are now supported with paged KVCache strategy.

MAX framework​

Inference server​

  • Inflight batching no longer requires chunked prefill.

  • Expanded token sampling logic, including top_k, min_p, min_new_tokens, temperature.

  • Extended sampling configuration to be per-request, e.g. different requests can ask for different sampling hyperparameters.

  • Removed support for TorchScript and torch MLIR models.

max CLI​

  • Added the --use-subgraphs flag to max generate to allow for the use of subgraphs in the model.

  • Added the --port option to specify the port number with the max serve command.

Python API​

  • Lots of new APIs in the max.nn package.

  • Added max.mojo.importer module to import Mojo code into Python. See the docs for calling Mojo from Python.

  • Added Graph.add_subgraph() to allow for the addition of a subgraph to a graph.

  • Added Module.build_subgraph() to allow for the creation of a subgraph for a layer that inherits from Module.

  • Added the call op which allows for the execution of a subgraph.

  • Added the fold op for combining sliding blocks into a larger tensor.

  • Added KernelLibrary as an argument type for the Graph constructor.

  • Added QuantizationConfig to specify quantization parameters for ops such as qmatmul().

  • Added the strict argument to the Module.load_state_dict() method. When strict=True (default), an error is raised if the state_dict contains unused keys. When strict=False, extra keys are ignored. This helps model developers identify missing implementations in their models.

  • Added audio generator APIs for text-to-speech models (such as AudioGenerator, PipelineAudioTokenizer, TTSContext, and others). This is still a work in progress.

  • The ops.masked_scatter() function now requires naming the out_dim explicitly as it is data-dependent. For example:

    ops.masked_scatter(
        inputs_embeds, video_mask, video_embeds, out_dim="unmasked_inputs"
    )
  • Deprecated the CONTINUOUS KVCache strategy (KVCacheStrategy). Please use PAGED KVCache strategy instead.

  • Removed the Settings argument from LLM constructor. The server is now automatically configured in the background without consuming an HTTP port.

  • Removed Graph.unique_symbolic_dim().

  • Removed max_to_torch_type() and torch_to_max_type() and replaced them with DType.to_torch() and DType.from_torch(), respectively. This aligns with the corresponding NumPy methods.

  • Removed stats_report property and reset_stats_report method from InferenceSession. This functionality was primarily used for internal PyTorch debugging and is no longer needed.

  • Removed the naive KVCache (nn.kv_cache.naive_cache).

  • Removed nn.attention and nn.naive_attention_with_rope.

  • Renamed ops.select to ops.where. This matches the name of the similar operation in torch and numpy.

Mojo API​

  • LayoutTensor now has a size method to get the total number of elements.

  • Following our previous deprecation of the Mojo max.driver, max.graph and max.engine APIs, we've removed them from the package and API docs.

As a result, we've also removed Mojo max.tensor APIs (including Tensor, TensorShape, and TensorSpec). You can replace any use with LayoutTensor.

Custom ops​

  • Improved error messages when custom op parameters are provided with values that don't have the proper type.

  • The ops.custom() function now requires a device argument to specify where the operation should execute. This avoids the need for custom ops to infer their execution device, which can be error-prone.

  • Added the max.torch module with the CustomOpLibrary class for using custom Mojo kernels from PyTorch. For example, with a custom grayscale operation written in Mojo:

    @register("grayscale")
    struct Grayscale:
        @staticmethod
        fn execute[
            # The kind of device this is running on: "cpu" or "gpu"
            target: StaticString,
        ](
            img_out: OutputTensor[dtype = DType.uint8, rank=2],
            img_in: InputTensor[dtype = DType.uint8, rank=3],
            ctx: DeviceContextPtr,
        ) raises:
            ...

    You can load it with PyTorch like so:

    from max.torch import CustomOpLibrary
    
    op_library = CustomOpLibrary("path/to/custom.mojopkg")
    
    @torch.compile(backend=backend)
    def grayscale(pic):
        result = pic.new_empty(pic.shape[:-1])
        op_library.grayscale(result, pic)
        return result
    
    img = (torch.rand(64, 64, 3) * 255).to(torch.uint8)
    result = grayscale(img)

    See our

    tutorial to write custom ops for PyTorch, and our

    PyTorch custom operation examples, which range from a very basic "hello world" to the replacement of a layer in a full model.

GPU programming​

  • Full support for AMD CDNA3 datacenter GPUs is now available! Specifically, MI300X and MI325X.

  • Added initial support for programming on AMD RDNA3 consumer GPUs. Basic tuning parameters have been specified for AMD Radeon 780m integrated GPUs. (AMD RDNA3 support is for GPU programming only; AI models are still missing some GPU kernels for this architecture.) For details, see the GPU requirements.

  • Now accepting CPU and GPU kernel contributions. See the MAX AI kernels contributing guide.

Mojo language​

For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

v25.3 (2025-05-06)​

✨ Highlights​

  • You can now install Modular APIs and tools with pip:

    pip install modular \
      --index-url https://download.pytorch.org/whl/cpu

    This installs the max CLI, max Python library, mojo CLI, and Mojo libraries. However, the Mojo LSP and debugger are currently not included.

    We use the --index-url argument to ensure that torch installs its CPU dependencies only, thus avoiding a lot of unnecessary GPU packages. This is a temporary workaround until we can remove our dependency on torch.

  • We open-sourced the MAX AI kernels and the rest of the Mojo standard library!

The MAX AI kernels library is a new Mojo API for writing high-performance and portable programs across CPU and GPU, but it's also the source code for our CPU/GPU kernels. You can now see the Mojo code we use in MAX to power GenAI workloads on CPUs and GPUs.

Just like the Mojo standard library, these kernels are open source under the Apache 2.0 License with LLVM exceptions. Plus, the rest of the Mojo standard library is also now open source in GitHub.

  • Learn to program GPUs with Mojo GPU Puzzles!

    This is a brand new site that offers a hands-on guide to mastering GPU programming with Mojo. Starting from basic concepts, you'll learn step-by-step how to program for GPUs by solving increasingly challenging puzzles.

Documentation​

We've restructured the documentation to unify MAX and Mojo documentation under the Modular Platform. We believe this improves content discovery with a simplified navigation and helps unify the platform story as a whole.

We've also added the following new docs:

  • REST API reference: Although it's not a new API (our serving library has supported OpenAI APIs for the last few versions), this now shows precisely which endpoints and body parameters we support.

  • Speculative decoding: An introduction to using speculative decoding to reduce latency for LLMs. This feature is still in development.

  • Offline inference: An introduction to our Python API for running inference with an LLM locally (without sending requests to a serving endpoint).

  • Introduction to layouts: A guide to working with dense multidimensional arrays on CPUs and GPUs, using new Mojo layout types that abstract-away complex memory layout patterns.

max CLI​

  • Renamed the max-pipelines CLI tool to max. We recommend re-installing it as shown in the max CLI docs.

  • Remove previously deprecated --use-gpu, --serialized_model_path, --save_to_serialized_model_path, --max_cache_batch_size and --huggingface-repo-id options.

  • Move InputContext, TextContext, and TextAndVisionContext from max.pipelines to max.pipelines.context.

MAX models​

  • Added Llama4ForConditionalGeneration support, featuring new MoE layers. Currently, it is limited to text inputs. Run the model by calling:

    max generate --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --devices 0,1,2,3
  • Added support for running text generations using the Mistral 3 24B model. Run the model with:

    max generate --model-path mistralai/Mistral-Small-3.1-24B-Instruct-2503 --devices 0
  • Fixed empty textual outputs for certain Mistral models (MAX issue 4193).

  • Added support for loading a custom pipeline architecture by module. Using --custom-architectures=folder/path/to/import:my_module will lead to loading architectures from the file. The architectures must be exposed via an ARCHITECTURES variable in the file. Once loaded, a model can be run using the new architectures. The flag can be specified multiple times to load more modules.

MAX Serve​

  • Moved from radix trie to hash based prefix caching implementation which has smaller CPU overheads. This improves performance particularly in workloads with high cache reuse rates.

  • Added experimental support for offloading KVCache to host memory via the --enable-kvcache-swapping-to-host and --host-kvcache-swap-space-gb flags. This allows for superior KVCache reuse through prefix caching in workloads where the reusable KVCache amount exceeds GPU VRAM.

  • Fixed the usage.prompt_tokens field in the OpenAI API Usage Info response. Previously this field was always set to Null, but now it correctly contains the number of prompt tokens in the request.

  • Switched from Python Multiprocessing Queue to ZeroMQ. This reduces latencies between frontend server process and model worker process related to networking.

  • Stray model workers on Linux now terminate more reliably when the parent process is killed.

MAX Engine & Graph​

Python API​

  • We now raise an error if there's a mismatch between the expected device of a weight on a graph and the device of the actual tensor data specified in InferenceSession.load().

  • Removed output_device argument from Model.execute().

  • Removed the copy_inputs_to_device argument in Model.execute to improve predictability of the API. Now execute() raises a TypeError if arguments are passed whose devices don't match the model.

  • Swapped the order of the dtype and shape fields of driver.Tensor. Previously, the arguments are ordered as (shape, dtype). They are now swapped to (dtype, shape) to be in line with other tensor-like types.

  • Replaced some instances of Tensor.zeros with Tensor.__init__ when the engine did not depend on the tensor being zero initialized. This elides the unnecessary memset to provide a minor performance improvement.

  • Added a new experimental Tensor.inplace_copy_from(). This allows users to copy the contents of one Tensor into another.

  • Made the default behavior of Weight as expecting the initial allocation on host. A transfer is then inserted to the target device and this value is returned when weights generate an MLIR value. This is done due to current conservative ownership around external weights.

  • Added the irfft op, which computes the inverse real fast fourier transform (FFT).

  • Added the argmax op, which returns the index of the maximum value in an array or sequence.

  • Added the GroupNorm layer.

  • Switched layer names so that max.nn layers that are implemented with the deprecated Layer class are marked as "V1", and layers that are implemented with the new max.nn.Module are the default. That is, max.nn.LinearV2 is now max.nn.Linear, and the previous max.nn.Linear is now max.nn.LinearV1.

  • DeviceRefs in types/layers are in general expected to be explicit rather than implicit.

Mojo API​

  • Removed some functionality from tensor.Tensor:

    • Serializing Tensor to disk (Tensor.tofile(path) and Tensor.save(path)).
    • Reading the serialized data back from disk (Tensor.load(path) and Tensor.fromfile(path).
    • rand and randn methods have been removed. Use the ones in the Mojo standard library if you still need access for constructing a new Tensor with random elements based on a particular TensorShape.
  • Deprecated the Mojo Driver, Graph, and Engine APIs

    These APIs are not currently used internally. Instead, we build graphs using the Python APIs, and our engineering efforts have been focused on making that experience as robust and user-friendly as possible. As a result, the Mojo versions of these APIs have not kept pace with new features and language improvements. These APIs will be open sourced for the community before being removed.

Custom ops API​

  • You can now pass Mojo source package paths as Graph custom extensions. The Mojo code will be compiled automatically, no need to run mojo package manually as a prior step. Previously, only pre-compiled .mojopkg paths were accepted, requiring the Mojo code to be built as a prerequisite step before running a Graph with a custom op.

    Given a project structure like:

    project
    |-- main.py
    \-- kernels
        |-- __init__.mojo
        \-- my_custom_op.mojo

    You can construct a Graph in main.py using Mojo custom op kernels simply using:

    g = Graph(
      ...,
      custom_extensions = [Path(__file__).parent / "kernels"]
    )

    A change to your Mojo source code defining a custom op will be reflected immediately the next time the Graph is constructed.

  • New image_pipeline example that demonstrates sequencing custom ops together which modify an image, leaving data on the GPU for each op, before writing it back to CPU and disk.

Kernels​

  • More compute overlap is now enabled for Hopper GPUs. This allows finer-grained scheduling of kernel operations by analyzing producer-consumer patterns within a compute kernel. As a result, there is more kernel compute overlap, especially for compute-heavy kernels with data-dependent execution paths.

GPU programming​

  • CUDA driver requirement reduced to version 12.4 and the NVIDIA driver to be version 550. Requiring these earlier driver versions allows MAX to be more easily deployed on AWS and GCP, since these are the default versions used by those cloud providers.

  • Added support for programming NVIDIA Jetson Orin GPUs (sm_87).

Also see the Mojo changelog of GPU changes.

Mojo language​

  • We recently open-sourced the rest of the Mojo standard library, including the algorithm, benchmark, buffer, compile, complex, gpu, and layout packages. See it all in GitHub.

  • We've also open sourced all our MAX AI kernels. This new library includes kv_cache, layout, linalg, nn, nvml, and quantization.

For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

v25.2 (2025-03-25)​

✨ Highlights​

  • Support for NVIDIA Hopper GPUs

    MAX has been optimized to run on Hopper GPUs. For more information on MAX and NVIDIA's hardware, see the MAX container documentation.

  • Multi-GPU support

    MAX uses tensor parallelism to distribute work across multiple GPUs so you can run LLMs like Llama-3.3-70B-Instruct, even with long context window.

  • Expanded library of MAX models

    We're rapidly growing our library of base model architectures that MAX can accelerate with MAX Serve (including Phi3ForCausalLM, OlmoForCausalLM, and GraniteForCausalLM). We also now support GTPQ for the Llama models. For more information, check out our MAX model repository.

  • Advanced E2E optimizations for long context window

    In flight batching, chunked prefill, and copy-on-write optimize the execution for prefix heavy and long context window scenario.

  • GPU programming with Mojo

    Lots of new APIs are now available to enable both low-level GPU programming and abstracted programming patterns that simplify the code required to write GPU kernels for your AI models.

MAX Serve​

  • Extended MAX Serve batch scheduling to account for the prefix cache. The scheduler can now create larger batches when many prompt tokens are already cached, improving throughput up to 10% in some benchmarks.

  • Added support for in-flight batching, allowing token generation requests to be scheduled alongside context encoding requests to reduce inter-token latency. This behavior can be controlled by CLI argument --enable-in-flight-batch.

  • Added support for copy-on-write on KV blocks when using PagedAttention with Prefix Caching. This improves the prefix cache hit rate and prefill performance in some scenarios.

  • MAX Serve now supports transformers v.4.49.0, with a patch to avoid graph breaks when using torch.compile() on Llama models.

  • Added support for recording HTTP traffic out to a file for diagnostics or later replay.

MAX models​

  • Added support for executing LlamaForCausalLM architecture models on multiple GPUs. The model uses tensor parallelism automatically when passing multiple device IDs to the --devices CLI argument. Try running meta-llama/Llama-3.3-70B-Instruct on 4 GPUs with the following example:

    max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
      --quantization-encoding bfloat16 \
      --devices gpu:0,1,2,3 \
      --prompt="Design a
        self-sustaining colony on Neptune's moon Triton with a myth/science
        fusion name, three quantum tech breakthroughs, one ethical debate, a
        neon-lit cultural ritual, and a hidden flawβ€”presented in bullet points."
  • Added support for the Phi3ForCausalLM model architecture (such as microsoft/phi-4). For example:

    max-pipelines generate \
      --model-path microsoft/phi-4 \
      --prompt "Write bubble sort in mojo"
  • Added support for the OlmoForCausalLM model architecture (such as allenai/OLMo-1B-0724-hf). For example:

    max-pipelines generate \
      --model-path allenai/OLMo-1B-0724-hf \
      --prompt "Write bubble sort in mojo"
  • Added support for the GraniteForCausalLM model architecture (such as ibm-granite/granite-3.1-8b-instruct). For example:

    max-pipelines generate \
      --model-path ibm-granite/granite-3.1-8b-instruct \
      --prompt "Write bubble sort in mojo"
  • Added support for:

  • We now support GPTQ quantization for models that run on the GPU. This is handled transparently when the model weights are specified. For example, this runs Llama 3.1 8B using int4-quantized GPTQ weights:

    max-pipelines generate \
      --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
      --prompt "Why is the sky blue?" \
      --max-batch-size 1 \
      --max-length 10000

    This reduces the total memory consumption of this model from ~16 GB to ~5 GB, allowing the model to fit in the RAM smaller GPUs.

  • Model weights are now downloaded in parallel.

  • Added constraints on whitespace during Structured Output. This reduces tokens counts and improves model adherence.

  • Added jump ahead decoding during Structured Output. This auto-completes tokens when a singular path forward is identified, improving single completion times by up to ~20% for long prompts.

  • In the event of an unhandled exception, we now use the standard Python traceback format instead of using pretty-printed Rich tracebacks.

  • We now need to explicitly import LLM from max.entrypoints.llm rather than the previous max.entrypoints import.

  • The max.pipelines.dataprocessing.tokenizer and max.pipelines.dataprocessing.gguf_utils modules have been removed.

  • The previously deprecated PipelineConfig.architecture field and its corresponding --architecture CLI argument have been removed.

max-pipelines CLI​

  • The --devices CLI argument now supports a comma-separated list of GPU IDs prefixed with gpu: like --devices=gpu:0,1,2,3. We no longer support the previous --devices=gpu-<N> format.

    max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
      --quantization-encoding bfloat16 \
      --devices gpu:0,1,2,3 \
      --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flawβ€”presented in bullet points."
  • Removed --huggingface-repo-id PipelineConfig option and CLI argument in favor of --model-path.

  • We consolidated --model-path and -weight-path. Valid --weight-path values now override --model-path, which handles both local and remote (Hugging Face) cases. If we cannot derive the weights from the --weight-path, we now fall back to the --model-path, which you must set explicitly.

  • Added --huggingface-revision option, to allow selecting a non-default branch or a specific commit in a Hugging Face model repository.

MAX Engine​

  • The MAX graph compiler now has kernel caching. This is a significant improvement to our compilation pipeline. Here are some of the highlights:

  • Up to 28% faster compilation times when making iterative changes to models

  • Improved caching between different but similar models (up to 27% faster)

  • Lays foundation for future caching optimizations

What does this mean for you? Faster development cycles! When you're working on model pipelines and making changes to the graph, the graph compiler will now intelligently reuse kernels that haven't changed, significantly reducing compilation times.

The improvements are particularly noticeable during iterative development, with compilation times dropping from ~80s to ~57s in some cases of compiling Llama3.1-8B for 4 GPUs. Even when compiling different models from the same family (like Llama/Granite variants), you'll see significant speedups on subsequent compilations.

Driver APIs​

  • Added Accelerator.can_access(other: Device) -> bool method to check if one device can directly access memory of another device.

  • Fixed a bug in max.driver.tensor.load_max_tensor() for bfloat16 dtype, which would cause an error about mmap size being too large.

  • max.driver.Tensor.item() now works on any single-element tensor (previously restricted to rank-0 tensors).

  • Added Device.synchronize(), which ensures all operations on the device complete before returning.

  • Removed MojoCallContextPtr in favor of DeviceContextPtr. MojoCallContextPtr only contained a DeviceContextPtr, so this change directly exposes the DeviceContextPtr. Custom ops using MojoCallContextPtr now directly take a DeviceContextPtr argument:

        @staticmethod
        fn execute[
            type: DType, rank: Int
        ](
            output: OutputTensor[type=type, rank=rank],
            input: InputTensor[type=type, rank=rank],
            ctx: MojoCallContextPtr,
        ):

    becomes

        @staticmethod
        fn execute[
            type: DType, rank: Int
        ](
            output: OutputTensor[type=type, rank=rank],
            input: InputTensor[type=type, rank=rank],
            ctx: DeviceContextPtr,
        ):
  • You can now skip compiling a GPU kernel first before enqueueing it, and pass a function directly to ctx.enqueue_function[func](...):

    fn func():
        print("Hello from GPU")
    
    @register("custom_op")
    struct CustomOp:
    
        @staticmethod
        fn execute(ctx: DeviceContextPtr) raises:
            var dev_ctx = ctx.get_device_context()
            dev_ctx.enqueue_function[func](grid_dim=1, block_dim=1)

    However, if you're reusing the same function and parameters multiple times, this incurs some overhead of around 50-500 nanoseconds per enqueue. So you can still compile the function first and pass it to ctx.enqueue_function in this scenario:

    var compiled_func = ctx.compile_function[func]()
    # Multiple kernel launches with the same function/parameters
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
  • Changed Accelerator and CPU from factory methods that created Device objects in Python (which were accelerators and CPUs in the C++ implementation) to actual Python types. This change elevates the Accelerator and CPU type concepts to Python, making them types rather than methods.

    This allows type annotations in Python. For example, a list of accelerators used to be defined like this:

    graph_devices: list[DeviceRef]

    Now it can be defined like this:

    graph_devices: list[Accelerator]
  • Elementwise operations (e.g. __add__) have been removed from Tensor (that is, tensor_internal.Tensor). This Tensor type is being phased out; please reduce usage in favor of LayoutTensor.

Graph APIs​

  • The nn package is now max.nn.

  • Added ops.chunk) to support chunking tensors along an axis.

  • Added support for while loops with ops.while_loop.

  • Added support for conditional execution with ops.cond.

  • Added axis reduction overloads for ops.min and ops.max. For example; ops.min(tensor, axis=-1).

  • The gelu() function now accepts an approximate keyword. The keyword controls the gelu approximation with none, tanh, and fast approximations accepted.

  • Removed the roundeven() operation from the Python API. The round() operation now has the same behavior as roundeven(), so there is no need for both to exist.

  • Added helpers to create analogous tensors from buffer types and vice versa.

  • Added max.nn.Module, a base class for writing layers and constructing networks of layers (e.g. using max.nn.Sequential). Currently, this class supports graph building by ensuring that all weight names are unique and systematically generated. This class also supports managing the weight values with the module.state_dict() and module.load_state_dict() methods. More functionality and documentation will be added in future releases.

Custom ops​

  • Changes have been made to the way that custom ops are registered: rather than using the num_dps_outputs attribute on @compiler.register to specify the number of outputs, that number is now inferred from the signature of the custom operation. Inputs to the operation now use the InputTensor type and outputs from the operation use OutputTensor, instead of the previous ManagedTensorSlice for both. This eliminates the need for a manual num_dps_outputs attribute, and makes it safer to work with these inputs and outputs by preventing accidental writes to input tensors. The new interface looks something like the following:

    @compiler.register("add_one_custom")
    struct AddOneCustom:
        @staticmethod
        fn execute[
            target: StringLiteral,
        ](
            out: OutputTensor,
            x: InputTensor[type = out.type, rank = out.rank],
            ctx: DeviceContextPtr,
        ) raises:
            @parameter
            @always_inline
            fn elementwise_add_one[
                width: Int
            ](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
                return x.load[width](idx) + 1
    
            foreach[elementwise_add_one, target=target](out, ctx)
  • The foreach function now raises to be able to handle errors within an elementwise calculation.

Hopper kernels​

State-of-the-Art Kernels in Mojo for H100/H200 GPUs

  • Hopper Architecture Matrix Multiplication Kernels: The implementation achieved performance comparable to NVIDIA's highly optimized cuBLAS library. These kernels take full advantage of the Tensor Cores in Hopper architecture GPUs to accelerate the fundamental matrix multiplication operations that underpin deep learning workloads.

  • Multi-GPU AllReduce Implementation: The AllReduce operation is critical for distributed inference across multiple GPUs, as it efficiently aggregates gradients. The Mojo implementation surpassed NVIDIA's NCCL library in performance benchmarks. This improvement reduces communication overhead during distributed inference.

  • MAX Attention Kernel with Flash Attention 3: This implementation incorporates the latest Flash Attention 3 algorithm and extends it, which significantly accelerates the computation of attention mechanisms in transformer models. The MAX attention kernel optimizes memory access patterns and computational steps, reducing both the memory footprint and execution time of attention operations. This is particularly important for LLMs where attention calculations represent a substantial portion of the computational workload.

GPU programming​

  • Added the Mojo max.driver API to enable dispatching GPU functions from Mojo.

Check out examples for GPU programming in Mojo, which use this new API.

Mojo​

Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

Documentation​

New examples for writing custom ops:

  • fused_attention demonstrates complex GPU programming using MAX abstractions for a practical use in AI model development.

  • matrix_multiplication includes a series of progressive optimizations for matrix multiplications on GPUs.

  • histogram shows how to implement the histogram pattern as a custom op.

  • New examples for GPU programming in Mojo using the new MAX Driver API

    • These use a Mojo programming model that should look familiar to CUDA C programmers, showing how to define and dispatch GPU functions within a single Mojo file. These examples recreate the first three samples from the popular textbook "Programming Massively Parallel Processors", showing how basic concepts translate from CUDA into Mojo. Additionally, a Mandelbrot set calculation example that parallels a similar one in the existing custom ops examples.
  • New MAX containers available. For more information on the base and full MAX containers, see Container contents.

v25.1.1 (2025-02-19)​

Fix performance issues in autoregressive models with paged attention by setting sensible default values for --max-num-steps that are platform-specific.

v25.1 (2025-02-13)​

✨ Highlights​

  • Custom ops for GPUs

    Our new custom op API allows you to extend MAX Engine with new graph operations written in Mojo that execute on either CPU or GPU, providing full composability and extensibility for your models. See more in the section about GPU programming.

  • Enhanced support for agentic workflows

    MAX Serve now supports function calling, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.

    MAX Serve now supports structured output (also known as constrained decoding) for MAX models on GPU. This allows you to enforce the output format from a model using an input schema that defines the output structure. Learn more about structured output.

  • Extended model architecture support

  • New max-pipelines CLI tool

    Instead of cloning our GitHub repo to access our latest GenAI models, you can instead install the max-pipelines CLI tool and quickly run an inference or deploy an endpoint.

Documentation​

New tutorials:

Other docs:

MAX Serve​

  • The /v1/completions REST endpoint now supports:

    • Pre-tokenized prompts.

    • Image inputs for multimodal models such as Llama-3.2-11B-Vision-Instruct. For an example, see how to generate image descriptions with Llama 3.2 Vision.

      Known issue: You might receive faulty results because some parts of the text prompt get ignored for certain input combinations. We've identified the problem and will have a fix in a subsequent nightly release.

    • Function calling and tool use, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.

    • Structured output (also known as constrained decoding), which allows you to enforce the output format from a model using a JSON schema and the response_format field. To enable constrained decoding pass --enable-structured-output when running the server. However, this feature currently works for MAX models on GPU only (support for PyTorch models and CPU is in progress). Learn more about structured output.

  • Added support for the /v1/embeddings API endpoint, allowing you to generate vector representations using embedding models. See how to deploy a text embedding model.

  • Max Serve can evict requests when the number of available pages in the PagedAttention KVCache is limited. Before, the KV manager would throw an OOM error when a batch that cannot fit in the cache was scheduled.

MAX models​

  • Added the max-pipelines CLI tool that simplifies the process to run inference with GenAI models (specified with a Hugging Face repo ID) and deploy them to a local endpoint with MAX Serve.

    Previously, running or serving these models required cloning the modular/max GitHub repo and then running commands such as magic run llama3.

    These model-specific commands like llama3 and replit commands have been removed. They're now standardized and subsumed by flags like --model-path in the max-pipelines tool. Arguments such as --max-length and --weight-path are also still supported by max-pipelines.

    To view a list of supported model architectures from Hugging Face, run max-pipelines list.

  • Added support for PagedAttention, which improves memory efficiency by partitioning the KV cache into smaller blocks, reducing fragmentation and enabling larger inference batches. You can enable it with --cache-strategy=paged and --kv-cache-page-size with a value that's a multiple of 128.

  • Added support for prefix caching in all cases where PagedAttention is supported. This allows for more efficient usage of KVCache and improved prefill performance for workloads with common prefixes. You can enable it by setting --enable-prefix-caching. For more information, see Prefix caching with PagedAttention.

  • Batch size and max length are now inferred from available memory and the HF Models' default values for max length, respectively. If a configuration leads to an OOM, then we provide recommendations (to the best of our ability) to the user to fit the model into memory.

  • Added support for heterogeneous KV caches for multi-modal models, such as Llama Vision, which cache different KV states for self and cross attention layers.

  • Added support for embedding models, starting with MPNet. For example:

    max-pipelines generate \
      --model-path=sentence-transformers/all-mpnet-base-v2 \
      --prompt="Encode this sentence."

    Also see how to deploy a text embedding model.

  • Added support for image and text multimodal models:

    • max-pipelines generate now accepts image input with --image_url.

    • Added an experimental Pixtral pipeline you can run as follows:

      max-pipelines generate \
        --model-path=mistral-community/pixtral-12b \
        --prompt="What is in this image? [IMG]" \
        --image_url=http://picsum.photos/1024/1024

      The pipeline is automatically used for all models implementing the LlavaForConditionalGeneration architecture.

      The implementation currently has a limit of one image. We plan support an arbitrary number of images of mixed sizes soon.

    • Added an experimental Llama Vision pipeline you can run as follows:

      max-pipelines generate \
        --model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \
        --prompt="<|image|><|begin_of_text|>What is in this image?" \
        --image_url=http://picsum.photos/1024/1024

      The pipeline is automatically used for all models implementing the MllamaForConditionalGeneration architecture.

      Note: This model is gated and requires that you set the HF_TOKEN environment variable. See Llama-3.2-11B-Vision-Instruct.

    • See how to generate image descriptions with Llama 3.2 Vision.

  • Added support for the Qwen2ForCausalLM model architecture (such as Qwen/Qwen2.5-7B-Instruct). For example:

    max-pipelines generate \
      --model-path=Qwen/Qwen2.5-7B-Instruct \
      --prompt="Write bubble sort in python" \
      --quantization-encoding bfloat16
  • Added support for offline batched inference for text-based LLMs, allowing you to load a model and run inference with a batch of inputs directly from Python, instead of relying on an HTTP interface. For an example, see examples/offline-inference/basic.py.

  • The --max-cache-batch-size flag has been deprecated in favor of --max-batch-size. Using --max-cache-batch-size now emits a deprecation warning and will stop working in a future release.

  • The --use-gpu flag has been deprecated in favor of --devices=cpu, --devices=gpu, or --devices=gpu-0,gpu-1,.... If the device isn't specified, the model runs on the first available GPU, or CPU if no GPUs are available.

MAX Engine​

  • Improved internal kernel compilation speed 1.5 - 4X across different models.

    We've revamped our GPU compilation process so that all kernels in a program are compiled together into a single LLVM module, then split into separate kernels afterward. This ensures shared code between kernel entry points is only compiled once. For example, we observe a 3.7x speed up for Llama3.1-8b GPU startup time.

  • Improved initial model execution speed on NVIDIA GPUs.

    Instead of compiling to PTX and performing just-in-time compilation during runtime, we now generate CUBIN binaries directly. While this increases initial compilation time, it significantly improves execution speed.

  • The kernels have been further tuned for performance on NVIDIA A100 GPUs.

Graph APIs​

  • You can now write custom operations (ops) in Mojo, and add them to a graph constructed in Python, using custom() and inplace_custom().

    For more detail, see the section below about GPU programming.

  • Cached compiled MAX graphs that make use of custom operations now get invalidated when the implementation of the custom operations change.

  • Graph.add_weight() now takes an explicit device argument. This enables explicitly passing GPU-resident weights to session.load() via the weights registry to initialize the model.

  • max.graph.Weight now inherits from TensorValue, allowing you to call weight.cast() or weight.T. As such, the TensorValue no longer accepts Weight for the value argument.

Pipeline APIs​

  • TextTokenizer.new_context() now supports tool definitions passed through its request argument (via TokenGeneratorRequest.tools).

  • Removed the default num_steps value for TokenGenerator.next_token(), ensuring users pass a value, reducing the potential for silent errors.

  • KVCacheStrategy now defaults to MODEL_DEFAULT.

    As opposed to the previous setting which always used the "continuous" caching strategy, KV caching strategy is now defaulted on an architecture-specific basis to ensure the most optimized caching strategy is used.

  • The Linear layer now has a create() class method that automatically creates specializations of Linear for non-quantized, k-quant, or GPTQ layers.

  • Added nn.Conv1D for audio models like Whisper.

GPU programming​

This release includes all new APIs to program on GPUs. The way to write code for GPUs is to create custom operations with GPU functions that you can load into a MAX graph. This foundational API includes a few key components:

  • Mojo APIs to write custom op functions:

    • The @compiler.register decorator is applied to a Mojo struct that implements a custom op in an execute() functionβ€”for either CPU or GPUβ€”and a shape() function that defines the custom op's output tensor.

    • The max.tensor package adds essential Mojo APIs for writing custom ops, such as:

      • The foreach() function, which efficiently executes an element-wise computation in parallel on either a GPU or CPU.

      • The ManagedTensorSlice type defines the input and output tensors for the custom op.

  • Python APIs to load custom ops into a model:

    • The custom() and inplace_custom() functions allow you to add the previously-defined Mojo custom op to a MAX graph written in Python.

    • The InferenceSession constructor accepts the custom op implementation as a Mojo package in the custom_extensions argument.

For more detail, see the tutorial to build custom ops for GPUs.

Additionally, we've added a new gpu package to the Mojo standard library that provides low-level programming constructs for working with GPUs. These APIs let you do things that you can't currently do with the high-level foreach() abstraction above. The Mojo gpu APIs allow you to manually manage interaction between the CPU host and GPU device, manage memory between devices, synchronize threads, and more. For some examples, see vector_addition.mojo and top_k.mojo.

Mojo​

Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

v24.6 (2024-12-17)​

This is a huge update that offers a first look at our serving library for MAX on GPUs!

Also check out our blog post introducing MAX 24.6.

✨ Highlights​

  • MAX Engine on GPUs preview

    We're excited to share a preview of MAX Engine on GPUs. We've created a few tutorials that demonstrate MAX's ability to run GenAI models with our next-generation MAX graph compiler on NVIDIA GPU architectures (including A100, A10, L4, and L40 GPUs). You can experience it today by deploying Llama 3 on an A100 GPU.

  • MAX Serve preview

    This release also includes an all-new serving interface called MAX Serve. It's a Python-based serving layer that supports both native MAX models when you want a high-performance deployment, and off-the-shelf PyTorch LLMs from Hugging Face when you want to explore and experimentβ€”all with GPU support. It provides an OpenAI-compatible REST endpoint for inference requests, and a Prometheus-compatible metrics endpoint. You can use a magic command to start a local server , or use our ready-to-deploy MAX container to start an endpoint in the cloud. Try it now with an LLM from Hugging Face.

  • Upgraded MAX models

    As we continue to build our Python-based MAX Graph API that allows you to build high-performance GenAI models, we've made a ton of performance improvements to the existing models and added a few new models to our GitHub repo. All the Python-based MAX models now support GPUs and broad model architectures. For example, llama3 adds compatibility for the LlamaForCausalLM family, which includes over 20,000 model variants and weights on Hugging Face.

Documentation​

New tutorials:

Other new docs:

Also, our documentation is now available for MAX nightly builds! If you're building with a nightly release, you can switch to see the nightly docs using a toggle to the right of the search bar.

MAX Serve​

This release includes a preview of our Python-based serving library called MAX Serve. It simplifies the process to deploy your own inference server with consistent and reliable performance.

MAX Serve currently includes the following features:

  • Deploys locally and to the cloud with our MAX container image, or with the magic CLI.

  • An OpenAI-compatible server with streaming /chat/completion and /completion endpoints for LLM inference requests.

  • Prometheus-compatible metrics endpoint with LLM KPIs (TTFT and ITL) for monitoring and evaluating performance.

  • Supports most TextGeneration Hugging Face Hub models.

  • Multiprocess HTTP/model worker architecture to maximize CPU core utilization by distributing multiple incoming requests across multiple processes, ensuring both high throughput and responsiveness.

  • Continuous heterogeneous batching to combine multiple incoming requests into a single inference (no waiting to fill a batch size) and improve total throughput.

There's much more still in the works for MAX Serve, but you can try it today with our tutorials to Deploy Llama 3 on GPU with MAX Serve.

Known issues:

  • While this release is enough to support typical chatbot applications, this release does not yet support the function-calling portion of the OpenAI API specification needed to enable robust agentic workflows.

  • Sampling is still limited and doesn't currently respect temperature or other sampling-related API request input.

  • Structured generation is not supported.

  • Support for multi-modal models is still nascent.

MAX models​

All of our Python-based GenAI models on GitHub now support GPUs!

As we add more models, we're also building a robust set of libraries and infrastructure that make it easier to build and deploy a growing library of LLMs. Some of which is available in a new max.pipelines package and some of it is alongside the models on GitHub. Here are just some of the highlights:

  • Deep integration with the Hugging Face ecosystem for a quick-to-deploy experience, such as using HF Model Hub tools to fetch config files, support for weights in safetensor format, support for HF tokenizers, and more. (We also support GGUF weight formats.)

  • Expanded set of model abstractions for use by different LLM architectures:

    • Attention layers (including highly optimized implementations with configurable masking, like AttentionWithRope). The optimized attention layers include variants that accept an attention mask. More memory-efficient variants that don't take a mask instead take a "mask functor" argument to the kernel, which implements masking without materializing a mask by computing a mask value from input coordinates on the fly.

    • Transformers such as Transformer and TransformerBlock. These include an initial implementation of ragged tensorsβ€”tensors for which each dimension can have a different size, avoiding the use of padding tokens by flattening a batch of sequences of differing lengths.

    • Common layers such as RMSNorm, Embedding, and Sequential.

    • KV cache management helpers, like ContinuousBatchingKVCacheManager.

    • Low-level wrappers over optimized kernels like fused_qk_ragged_rope. These are custom fused kernels that update the KV cache in place. Although they are custom, they reuse the underlying kernel implementation by passing in lambda functions used to retrieve inputs and write to outputs in place.

  • Added generalized interfaces for text generation such as TokenGenerator and PipelineModel, which provide modularity within the models and serving infrastructure. Also added a plug-in mechanism (PipelineRegistry) to more quickly define new models, tokenizers, and other reusable components. For example, anything that conforms to TokenGenerator can be served using the LLM infrastructure within MAX Serve. We then used this interface to create the following:

    • An optimized TextGenerationPipeline that can be combined with any compatible graph and has powerful performance features like graph-based multi-step scheduling, sampling, KV cache management, ragged tensor support, and more.

    • A generic HFTextGenerationPipeline that can run any Hugging Face model for which we don't yet have an optimized implementation in eager mode.

  • Models now accept weights via a weights registry, which is passed to the session.load() method's weights_registry argument. The decoupling of weights and model architecture allows implementing all of the different fine-tunes for a given model with the same graph. Furthermore, because the underlying design is decoupled, we can later expose the ability to compile a model once and swap weights out on the fly, without re-compiling the model.

  • Added generic implementations of common kernels, which allow you to plug-in different batching strategies (ragged or padded), KV cache management approaches (continuous batching), masking (causal, sliding window, etc.), and position encoding (RoPE or ALIBI) without having to re-write any kernel code. (More about this in a future release.)

  • Multi-step scheduling to run multiple token-generation steps on GPU before synchronizing to the CPU.

Updated models:

  • Significant performance upgrades for Llama 3, and expanded compatibility with the LlamaForCausalLM models family. For example, it also supports Llama 3.2 1B and 3B text models.

New models:

Known issues:

  • The Q4 quantized models currently work on CPU only.

  • Using a large setting for top-k with the Llama 3.1 model may lead to segmentation faults for certain workloads when run on NVIDIA GPUs. This should be resolved in the latest nightly MAX builds.

  • The models currently use a smaller default context window than the max_seq_len specified in the Hugging Face configuration files for a given model. This can be manually adjusted by setting the --max-length parameter to the desired context length when serving a model.

  • Some variants of the supported core models (like LlamaForCausalLM with different number of heads, head sizes, etc.) might not be fully optimized yet. We plan to fully generalize our implementations in a future release.

MAX Engine​

MAX Engine includes a lot of the core infrastructure that enables MAX to accelerate AI models on any hardware, such as the graph compiler, runtime, kernels, and the APIs to interact with it all, and it all works without external dependencies such as PyTorch or CUDA.

This release includes a bunch of performance upgrades to our graph compiler and runtime. We've added support for NVIDIA GPU architectures (including A100, A10, L4, and L40 GPUs), and built out new infrastructure so we can quickly add support for other GPU hardware.

Engine API changes:

  • InferenceSession now accepts a custom_extensions constructor argument, same as load(), to specify model extension libraries.

  • The Model object is now callable to run an inference.

Breaking changes:

  • Model.execute() signature changed to support GPUs.

    • The execute() function currently doesn't accept keyword arguments. Instead you can pass tensors as a driver.Tensor, int, float, bool, np.generic, or DLPackArray (DLPack). Note that both PyTorch and NumPy arrays implement the DLPack protocol, which means you can also pass either of those types to execute().

    • execute_legacy() preserves the semantics of execute() with support for keyword arguments to help with migration, but will be removed in a future release. execute_legacy() doesn't support GPUs.

    • Calling execute() with positional arguments still works the same.

Driver APIs​

MAX Driver (the max.driver module) is a new component of MAX Engine that's still a work in progress. It provides primitives for working with heterogeneous hardware systems (GPUs and CPUs), such as to allocate on-device memory, transfer data between host and device, query device stats, and more. It's a foundation on which other components of MAX Engine operate (for example, InferenceEngine now uses driver.Tensor to handle model inputs and outputs).

Driver API changes:

  • Added CUDA() device to open an NVIDIA GPU.

  • Added support for fp16 and bfloat16 dtypes.

  • Expanded functionality for max.driver.Device, with new class methods and properties. We are still working on building this out to support more accelerator features.

  • driver.Tensor (and the InferenceSession.load() argument weights_registry ) now supports zero-copy interoperability with NumPy arrays and PyTorch tensors, using DLPack / DLPackArray.

  • driver.Tensor has new methods, such as from_dlpack(), element_size() , to(), to_numpy(), view(), zeros(), and more.

MAX Driver APIs are still changing rapidly and not yet ready for general use. We'll publish more documentation in a future release.

Known issues:

  • MAX Driver is currently limited to managing just one NVIDIA GPU at a time (it does not yet support multi-GPU). It also does not yet support remote devices.

  • DLPack support is not complete. For example, streams are not yet supported.

Graph compiler​

When you load a model into MAX Engine, the graph compiler is the component that inspects and optimizes all graph operations (ops) to deliver the best run time performance on each device.

This release includes various graph compiler improvements:

  • Major extensions to support NVIDIA GPUs (and other devices in the future), including async copies and caching of JIT'd kernels.

  • The runtime now performs scheduling to enable GPU compute overlap with the CPU.

  • New transformations to the Mojo kernels to enable a number of optimizations, including specialization on tensor dimensions, specialization on target hardware, specialization on non-tensor dimension input to kernels, automatic kernel fusion between operators, and more.

  • New algebraic simplifications and algorithms for ops such as horizontal fusion of matrix multiplications.

  • New CPU-side primitives for device management that are automatically transformed and optimized to reduce overhead (MAX does not need to use things like CUDA Graphs).

  • Updated memory planning to preallocate device memory (hoist computation from inference runtime to initialization time) and reduce per-inference overhead.

Graph APIs​

The graph compiler is also exposed through the MAX Graph APIs (the max.graph package), which allow you to build high-performance GenAI models in Python.

Graph API changes:

  • Python stack traces from model execution failures now include a trace to the original op-creation, allowing for easier debugging during development.

  • The max.graph APIs now include preliminary support for symbolic algebraic expressions using AlgebraicDim, enabling more powerful support for checked dynamic shapes. This allows -Dim("x") - 4. Furthermore, the algebraic expressions simplify to a canonical form, so that for example -Dim("x") - 4 == -(Dim("x") + 4) holds.

  • More advanced dtype promotion now allows TensorValue math operators to just work when used with NumPy arrays and python primitives.

  • TensorValue has new methods, such as broadcast_to(), cast(), flatten(), permute(), and more.

  • Added BufferValue, which allows for device-resident tensors that are read and mutated within the graph.

  • DType has new methods/properties, align, size_in_bytes, and is_float().

  • Value constructor accepts more types for value.

  • TensorValue constructor accepts more types for value.

  • TensorValue.rebind() accepts a new message argument.

Breaking changes:

  • Graph.add_weight() now accepts Weight and returns TensorValue. Weight is essentially a named placeholder for a tensor that knows its name, dtype, shape, and optionally device and quantization encoding. Graph.add_weight() stages an op in the graph that is populated by a named weight in the weights registry passed to session.load.

  • The Weight constructor arguments changed; added align , dtype , and shape; removed assign , filepath, offset, and value.

  • The ops.scalar() method was removed along with the is_static() and is_symbolic() methods from all graph.type objects.

    • Instead of ops.scalar(), use ops.constant().

    • Instead of is_static() and is_symbolic(), use isinstance(dim, SymbolicDim) and isinstance(dim, StaticDim).

The MAX Graph APIs are not ready for general use but you can experiment with it now by following this tutorial. We'll add more documentation when we finish some API redesigns.

Custom op registration​

Although the APIs to write custom operators (ops) isn't ready for general use, this release includes a significant redesign that lays the groundwork. You might notice some associated APIs in this release and more APIs in the nightlies, so here's a little about the work in progress:

  • The custom op APIs will allow you to extend MAX Engine with new ops written in Mojo, providing full composability and extensibility for your models. It's the exact same API we use to write MAX Engine's built-in ops such as matmul. That means your custom ops can benefit from all our compiler optimization features such as kernel fusionβ€”your ops are treated the same as all the ops included "in the box."

  • The new API requires far less adornment at the definition site to enable the MAX model compiler to optimize custom ops along with the rest of the graph (compared to our previous version that used NDBuffer).

  • Custom ops support "destination passing style" for tensors.

  • The design composes on top of Mojo's powerful meta programming, as well as the kernel libraries abstractions for composable kernels.

We'll publish more documentation when the custom op API is ready for general use. Check out the MAX repo's nightly branch to see the latest custom op examples.

Known issues:

  • Custom ops don't have type or lifetime checking. They also don't reason about mutability. Expect lots of sharp corners and segfaults if you hold them wrong while we improve this!

Numeric kernels​

The GPU kernels for MAX Engine are built from the ground up in Mojo with no dependencies on external vendor code or libraries. This release includes the following kernel improvements:

  • AttenGen: a novel way to express attention pattern that's able to express different attention masks, score functions, as well as caching strategies.

  • State-of-the-art matrix multiplication algorithms with optimizations such as the following:

    • Pipelining and double-buffering to overlap data transfer and computation and to hide memory access latency (for both global and shared memory).

    • Thread swizzling to avoid shared memory bank conflicts associated with tensor core layouts.

    • Block swizzling to increase L2 cache locality.

  • SplitK/StreamK GEMM algorithms: divides the computation along the shared K dimension into smaller matrices which can then be executed independently on streaming multiprocessors (such as CUDA cores). These algorithms are ideal for matrices with large K dimension but small M dimension.

  • Large context length MHA: uses SplitK/StreamK to implement the attention mechanism and eliminate the need of a huge score matrix, which drastically reduces memory usage/traffic to enable large context length.

  • DualGemm: accelerates the multi-layer perceptron (MLP) layers where the left-hand side (LHS) is shared between two matrix multiplications.

Known issues:

  • The MAX kernels are optimized for bfloat16 on GPUs.

  • Convolution on GPU is not performance optimized yet.

  • Although v24.6 technically runs on H100, it doesn't include performance-optimized kernels for that device yet and it isn't recommended.

Mojo​

Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

v24.5 (2024-09-13)​

✨ Highlights​

  • Mojo and MAX are magical! We've created a new package and virtual environment manager, magic, for MAX and Mojo.

  • New Llama3.1 pipeline built with the new MAX Graph Python API.

  • We have not one, but two new Python APIs that we're introducing in this release:

⭐️ New​

  • Added repeat_interleave graph op.

  • Added caching for MAX graph models. This means that graph compilation is cached and the executable model is retrieved from cache on the 2nd and subsequent runs. Note that the model cache is architecture specific and isn't portable across different targets.

  • Support for Python 3.12.

MAX Graph Python API​

This Python API will ultimately provide the same low-level programming interface for high-performance inference graphs as the Mojo API. As with the Mojo API, it's an API for graph-building only, and it does not implement support for training.

You can take a look at how the API works in the MAX Graph Python API reference.

MAX Driver Python API​

The MAX Driver API allows you to interact with devices (such as CPUs and GPUs) and allocate memory directly onto them. With this API, you interact with this memory as tensors.

Note that this API is still under development, with support for non-host devices, such as GPUs, planned for a future release.

To learn more, check out the MAX Driver Python APIreference.

MAX C API​

New APIs for adding torch metadata libraries:

  • M_setTorchMetadataLibraryPath
  • M_setTorchMetadataLibraryPtr

πŸ¦‹ Changed​

MAX Engine performance​

  • Compared to v24.4, MAX Engine v24.5 generates tokens for Llama an average of 15%-48% faster.

MAX C API​

Simplified the API for adding torch library paths, which now only takes one path per API call, but can be called multiple times to add paths to the config:

  • M_setTorchLibraries -> M_setTorchLibraryPath

⚠️ Deprecated​

  • The max command line tool is no longer supported and will be removed in a future release.

❌ Removed​

  • Dropped support for Ubuntu 20.04. If you're using Ubuntu, we currently support Ubuntu 22.04 LTS only.
  • Dropped support for Python 3.8.
  • Removed built-in PyTorch libraries from the max package. See the FAQ for information on supported torch versions.

v24.4 (2024-06-07)​

πŸ”₯ Legendary​

  • MAX is now available on macOS! Try it now.

  • New quantization APIs for MAX Graph. You can now build high-performance graphs in Mojo that use the latest quantization techniques, enabling even faster performance and more system compatibility for large models.

    Learn more in the guide to quantize your graph weights.

⭐️ New​

MAX Mojo APIs​

  • Added AI pipeline examples in the max repo, with Mojo implementations for common transformer layers, including quantization support.

    • New Llama3 pipeline built with MAX Graph.

    • New Replit Code pipeline built with MAX Graph.

    • New TinyStories pipeline (based on TinyLlama) that offers a simple demo of the MAX Graph quantization API.

  • Added max.graph.checkpoint package to save and load model weights.

    All weights are stored in a TensorDict. You can save and load a TensorDict to disk with save() and load() functions.

  • Added MAX Graph quantization APIs:

    • Added quantization encodings BFloat16Encoding, Q4_0Encoding, Q4_KEncoding, and Q6_KEncoding.
    • Added the QuantizationEncoding trait so you can build custom quantization encodings.
    • Added Graph.quantize() to create a quantized tensor node.
    • Added qmatmul() to perform matrix-multiplication with a float32 and a quantized matrix.
  • Added some MAX Graph ops:

    • avg_pool()
    • max_pool()
    • conv2d()
    • conv3d()
    • layer_norm()
    • tile()
    • select()
  • Added a layer() context manager and current_layer() function to aid in debugging during graph construction. For example:

    with graph.layer("foo"):
        with graph.layer("bar"):
            print(graph.current_layer())  # prints "foo.bar"
            x = graph.constant[DType.int64](1)
            graph.output(x)

    This adds a path foo.bar to the added nodes, which will be reported during errors.

  • Added format_system_stack() function to format the stack trace, which we use to print better error messages from error().

  • Added TensorMap.keys() to get all the tensor key names.

MAX C API​

Miscellaneous new APIs:

  • M_cloneCompileConfig()
  • M_copyAsyncTensorMap()
  • M_tensorMapKeys() and M_deleteTensorMapKeys()
  • M_setTorchLibraries()

πŸ¦‹ Changed​

MAX Mojo API​

  • EngineNumpyView.data() and EngineTensorView.data() functions that return a type-erased pointer were renamed to unsafe_ptr().

  • TensorMap now conforms to CollectionElement trait to be copyable and movable.

  • custom_nv() was removed, and its functionality moved into custom() as a function overload, so it can now output a list of tensor symbols.

v24.3 (2024-05-02)​

πŸ”₯ Legendary​

  • You can now write custom ops for your models with Mojo!

    Learn more about MAX extensibility.

πŸ¦‹ Changed​

  • Added support for named dynamic dimensions. This means you can specify when two or more dimensions in your model's input are dynamic but their sizes at run time must match each other. By specifying each of these dimension sizes with a name (instead of using None to indicate a dynamic size), the MAX Engine compiler can perform additional optimizations. See the notes below for the corresponding API changes that support named dimensions.

  • Simplified all the APIs to load input specs for models, making them more consistent.

MAX Engine performance​

  • Compared to v24.2, MAX Engine v24.3 shows an average speedup of 10% on PyTorch models, and an average 20% speedup on dynamically quantized ONNX transformers.

MAX Graph API​

The max.graph APIs are still changing rapidly, but starting to stabilize.

  • AnyMoType renamed to Type, MOTensor renamed to TensorType, and MOList renamed to ListType.

  • Removed ElementType in favor of using DType.

  • Removed TypeTuple in favor of using List[Type].

  • Removed the Module type so you can now start building a graph by directly instantiating a Graph.

  • Some new ops in max.ops, including support for custom ops.

    See how to create a custom op in MAX Graph.

MAX Engine Python API​

  • Redesigned InferenceSession.load() to replace the confusing options argument with a custom_ops_path argument.

    As a result, CommonLoadOptions, TorchLoadOptions, and TensorFlowLoadOptions have all been removed.

  • TorchInputSpec now supports named dynamic dimensions (previously, dynamic dimension sizes could be specified only as None). This lets you tell MAX which dynamic dimensions are required to have the same size, which helps MAX better optimize your model.

MAX Engine Mojo API​

  • InferenceSession.load_model() was renamed to load().

  • Redesigned InferenceSession.load() to replace the confusing config argument with a custom_ops_path argument for use when loading a custom op, and an input_specs argument for use when loading TorchScript models.

    Doing so removed LoadOptions and introduced the new InputSpec type to define the input shape/type of a model (instead of LoadOptions).

  • New ShapeElement type to allow for named dynamic dimensions (in InputSpec).

  • max.engine.engine module was renamed to max.engine.info.

MAX Engine C API​

❌ Removed​

  • Removed TensorFlow support in the MAX SDK, so you can no longer load a TensorFlow SavedModel for inference. However, TensorFlow is still available for enterprise customers.

    We removed TensorFlow because industry-wide TensorFlow usage has declined significantly, especially for the latest AI innovations. Removing TensorFlow also cuts our package size by over 50% and accelerates the development of other customer-requested features. If you have a production use-case for a TensorFlow model, please contact us.

  • Removed the Python CommonLoadOptions, TorchLoadOptions, and TensorFlowLoadOptions classes. See note above about InferenceSession.load() changes.

  • Removed the Mojo LoadOptions type. See the note above about InferenceSession.load() changes.

v24.2.1 (2024-04-11)​

  • You can now import more MAX Graph functions from max.graph.ops instead of using max.graph.ops.elementwise. For example:

    from max.graph import ops
    
    var relu = ops.relu(matmul)

v24.2 (2024-03-28)​

  • MAX Engine now supports TorchScript models with dynamic input shapes.

    No matter what the input shapes are, you still need to specify the input specs for all TorchScript models.

  • The Mojo standard library is now open source!

    Read more about it in this blog post.

  • And, of course, lots of Mojo updates, including implicit traits, support for keyword arguments in Python calls, a new List type (previously DynamicVector), some refactoring that might break your code, and much more.

    For details, see the Mojo changelog.

v24.1.1 (2024-03-18)​

This is a minor release that improves error reports.

v24.1 (2024-02-29)​

The first release of the MAX platform is here! πŸš€

This is a preview version of the MAX platform. That means it is not ready for production deployment and designed only for local development and evaluation.

Because this is a preview, some API libraries are still in development and subject to change, and some features that we previously announced are not quite ready yet. But there is a lot that you can do in this release!

This release includes our flagship developer tools, currently for Linux only:

  • MAX Engine: Our state-of-the-art graph compiler and runtime library that executes models from PyTorch and ONNX, with incredible inference speed on a wide range of hardware.

    • API libraries in Python, C, and Mojo to run inference with your existing models. See the API references.

    • The max benchmark tool, which runs MLPerf benchmarks on any compatible model without writing any code.

    • The max visualize tool, which allows you to visualize your model in Netron after partially lowering in MAX Engine.

    • An early look at the MAX Graph API, our low-level library for building high-performance inference graphs.

  • MAX Serving: A preview of our serving wrapper for MAX Engine that provides full interoperability with existing AI serving systems (such as Triton) and that seamlessly deploys within existing container infrastructure (such as Kubernetes).

    • A Docker image that runs MAX Engine as a backend for NVIDIA Triton Inference Server.
  • Mojo: The world's first programming language built from the ground-up for AI developers, with cutting-edge compiler technology that delivers unparalleled performance and programmability for any hardware.

    • The latest version of Mojo, the standard library, and the mojo command line tool. These are always included in MAX, so you don't need to download any separate packages.

    • The Mojo changes in each release are often quite long, so we're going to continue sharing those in the existing Mojo changelog.

Additionally, we've started a new GitHub repo for MAX, where we currently share a bunch of code examples for our API libraries, including some large model pipelines. You can also use this repo to report issues with MAX.

Model Architecture Support​

  • Added support for the following model architectures:

    • OlmoForCausalLM (such as allenai/OLMo-1B-0724-hf)
    • GraniteForCausalLM (such as ibm-granite/granite-3.1-8b-instruct)
    • Phi3ForCausalLM (for Microsoft Phi-3 models)
    • Qwen2ForCausalLM (such as Qwen2 models)

    Example usage:

    max-pipelines generate \
      --model-path allenai/OLMo-1B-0724-hf \
      --prompt "Write bubble sort in mojo"
  • The max.pipelines.dataprocessing.tokenizer and max.pipelines.dataprocessing.gguf_utils modules have been removed.

  • The previously deprecated PipelineConfig.architecture field and its corresponding --architecture CLI argument have been removed.

max-pipelines CLI​

  • The --devices CLI argument now supports a comma-separated list of GPU IDs prefixed with gpu: like --devices=gpu:0,1,2,3. We no longer support the previous --devices=gpu-<N> format.

    max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
      --quantization-encoding bfloat16 \
      --devices gpu:0,1,2,3 \
      --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flawβ€”presented in bullet points."
  • Removed --huggingface-repo-id PipelineConfig option and CLI argument in favor of --model-path.

  • Consolidated -model-path and -weight-path. If valid -weight-path(s) are provided, they'll now override --model-path, which in turn handles both local and remote (Hugging Face) cases. If we cannot derive the weights from the --weight-path(s), we'll now fall back to the --model-path, which has to be set explicitly by the user.

  • Added --huggingface-revision option, to allow selecting a non-default branch or a specific commit in a Hugging Face model repository.

Was this page helpful?