> For the complete documentation index, see [llms.txt](https://docs.modular.com/llms.txt).
> Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Here's everything you should know about each release.

## Nightly (v26.4)

This version is still a work in progress.

### MAX models

* Added support for the Tencent Hunyuan Hy3-preview (`HYV3ForCausalLM`)
  architecture: a decoder-only mixture-of-experts model (192 routed experts,
  top-8 plus one shared expert) with sigmoid plus correction-bias routing,
  per-head query/key RMSNorm, and split-half RoPE. Runs multi-GPU with
  tensor-parallel attention and expert-parallel MoE.
* Added NVFP4 quantization support for Gemma 4.
* Gemma 4 can now run native FP8 attention with an FP8 KV cache on B200
  (SM100): Q, K, and V are `float8_e4m3fn` read directly from the paged cache,
  and both Q\@K^T and P\@V execute as raw FP8 matmuls at tensorwise scale = 1
  (no per-block scales, no dequantization staging) with a bf16 attention
  output. This roughly matches bf16 accuracy while improving decode throughput
  and roughly doubling KV cache capacity at the same memory.
* Added MXFP4 quantization support for MiniMax-M2.
* Added tensor-parallel attention + expert-parallel MoE (TP+EP) support for
  MiniMax-M2. Set `data_parallel_degree: 1` with `runtime.ep_size > 1` to
  shard attention heads across GPUs while distributing MoE experts via
  expert parallelism. Both reduce-scatter (default) and allreduce
  (`runtime.ep_use_allreduce: true`) collective strategies are supported.
* Kimi K2.5 tool calling now supports interleaved thinking: a single
  assistant turn may interleave multiple `<think>...</think>` reasoning
  blocks with multiple tool-call sections and end with `<|im_end|>`. The
  constrained-decoding grammar (used for `tool_choice` and JSON
  `response_format`) admits up to eight tool-call sections with an optional
  reasoning block before each, and lets the model stop before the cap. This
  fixes a `tool_choice=auto` failure where a second tool-call section
  disabled grammar enforcement for the rest of the request.

### MAX framework

#### Inference server

* Chat completion responses now emit reasoning only under `reasoning`,
  aligning with OpenAI's Responses API naming. The `reasoning_content` alias
  (previously emitted alongside `reasoning` for compatibility with vLLM,
  SGLang, and the DeepSeek API) is no longer included in responses. vLLM has
  deprecated `reasoning_content` in favor of `reasoning`; see
  <https://github.com/vllm-project/vllm/pull/33402>. Clients should read
  chain-of-thought tokens from the `reasoning` field.

* `response_format` JSON schemas with a non-object root are now accepted when
  the root `type` is missing (any) or a type union that includes `object`
  (for example `{"type": ["object", "array", "string"]}`); these are valid
  JSON Schema and compile to a constraining grammar. A root pinned to a single
  non-object type (for example `{"type": "string"}`) is still rejected,
  matching OpenAI's structured-outputs contract.

* Added a per-phase startup breakdown to the `maxserve.model_load_time`
  Prometheus histogram (milliseconds), previously only available in the
  server logs. In addition to the existing untagged model-load aggregate,
  the model worker now records each startup phase on the same metric split
  by a `component` tag (`build`, `compile`, `init`, `graph_capture`,
  `pinned_memory`, `spawn`, and `total`), so a single metric can be plotted
  broken down by startup phase to track pod startup time in production.
  This replaces the `maxserve.startup_time` histogram (seconds) added
  earlier in this nightly cycle.

* Added a `maxserve.time_per_output_token` Prometheus histogram (milliseconds).
  Emitted once per request, it reports the mean decode-phase latency per
  generated token (`decode_time / (num_generated_tokens - 1)`), excluding the
  first token and prefill time. Because the denominator counts the tokens the
  model actually produced, the metric accounts for speculative decoding.

* The `maxserve.batch_size` Prometheus histogram is now labeled by
  `batch_type` (`CE` for prefill, `TG` for decode), so the token-generation
  (decode) batch size can be observed separately from prefill. For the
  prefill token-count view, use `maxserve.batch_input_tokens` (also labeled
  by `batch_type`). Existing aggregate queries over `maxserve.batch_size`
  continue to work; selectors that pin a single series now gain the
  `batch_type` dimension.

* Added Prometheus metrics for the API-server ingress backlog: requests accepted
  by the API server but not yet handed off to the model worker (still API-side,
  for example in tokenization). `maxserve.num_requests_awaiting_admission` is an
  up/down counter with the live value (incremented on arrival, decremented at
  handoff), and `maxserve.requests_awaiting_admission` is a companion histogram
  that captures the distribution / tail (p50/p99) over time. A persistently high
  value points at a backlog in the API server rather than in the scheduler queue
  (the latter is visible via `maxserve.num_requests_queued`).

* Added Prometheus metrics for the egress (response) path, which show whether
  the API server is shipping tokens back to clients slower than the model
  produces them: `maxserve.num_responses_buffered` (a gauge sampling the total
  model-worker responses received but not yet streamed to clients) with a
  companion `maxserve.responses_buffered` distribution histogram, and
  `maxserve.response_queue_time` (a millisecond histogram of how long a
  response waits in the API server's per-request output queue before the
  streaming layer consumes it). Together they surface API-side egress
  bottlenecks (detokenization, serialization, slow clients) and the associated
  unbounded-output-queue memory growth.

* MAX Serve now returns a clearer 400 Bad Request with the underlying
  message when a prompt is too long for the model, instead of a generic
  "Value error." response (or, for streaming completions, a 500 Internal
  Server Error). All architectures now raise a structured
  `PromptTooLongError` exposing `num_tokens` and `max_length` attributes
  so callers can handle the failure programmatically. The user-facing
  message identifies the relevant limit (LLM context window vs. diffusion
  text encoder sequence length): for example, "Prompt is too long: N
  tokens exceeds the configured maximum context length of M tokens.
  Please shorten your prompt."

* Fixed an FP8 dynamic-quantization bug that mis-quantized near-zero groups on
  NVIDIA GPUs (writing NaN into FP8 activations and the FP8 KV cache, surfacing
  downstream as non-finite logits). When a quantization group was near zero, its
  dynamic scale `max_abs / fp8_max` underflowed to a tiny denormal whose
  reciprocal overflowed to infinity; multiplying lanes by that infinity produced
  `+inf` (and `0 * inf = NaN` on zero lanes) *before* the FP8 cast. This is
  upstream of, and not addressed by, the saturating FP8 cast: clamping the
  result would turn the near-zero group into `±max_finite` garbage rather than
  the correct zero. The reciprocal is now guarded to be finite, so a near-zero
  group quantizes to a clean FP8 zero. Fixes the shared dynamic-scale helper
  (used by FP8 quantization, fused RMSNorm, and the residual-add AllReduce
  RMSNorm) and the fused RoPE plus KV-store path.

* Fixed a KV cache offloading correctness bug that corrupted output for
  multi-cache models (such as Gemma 4's interleaved sliding-window plus
  global attention) when the `local` or `tiered` KV connector was enabled.
  These models share one block pool across all of their caches, but the
  connector only offloaded and reloaded the primary cache, so a prefix-cache
  block served from host or disk restored only the primary cache's data and
  left the other caches' halves stale, degrading accuracy. The connector now
  offloads and restores every cache.

* Fixed JSON `response_format` and tool-call grammars not being enforced for
  Kimi K2.5 vision-language checkpoints. The Kimi K2.5 tokenizer did not carry
  grammar enforcement state onto the request context, so constrained-decoding
  requests fell back to an unenforced state and decoded freely (e.g. a
  `response_format=json_schema` request returned prose instead of
  schema-conformant JSON). The tokenizer now derives enforcement state from the
  response format, matching the text tokenizers.

* Fixed an intermittent constrained-decoding correctness bug under EAGLE
  speculative decoding. On the first decode step after a prefill (and after any
  batch that did not verify draft tokens), the speculative token bitmask was
  built from placeholder draft tokens instead of the real drafts being
  verified, leaving the bonus and later speculative slots unconstrained. A
  grammar-illegal token could then be sampled and committed, producing
  occasional JSON `response_format` or tool-call grammar violations. The bitmask
  is now built from the realized drafts.

* MAX Serve now accepts `role: "developer"` on `/v1/chat/completions`,
  normalizing it to `system` at the OpenAI-compat route layer. The OpenAI
  o1/o3 chat-completion spec uses `developer` in place of `system`, and
  recent OpenAI SDKs emit it by default. The previous behavior rejected
  the request with a 422 (`literal_error` on the message role).

* Fixed `CreateChatCompletionRequest` rejecting explicit `null` values for
  optional fields such as `tool_choice`, `tools`, and `response_format`.
  OpenAI-compatible clients (LangChain, JS SDKs, anything that serializes
  a dataclass with a `None` field) that emit `"tool_choice": null` instead
  of omitting the key are now accepted, matching the behavior of other
  OpenAI-compatible inference servers.

* Added two opt-in server flags for accepting OpenAI-compatible requests
  that the strict default behavior would reject:

  * `--allow-unsupported-logprobs`: when a request asks for `logprobs`
    against a runtime that cannot honor them (today, the overlap
    scheduler), MAX Serve logs a warning and serves the request without
    logprobs instead of returning a `400`.

  * `--allow-extra-request-fields`: unknown top-level fields on
    `/v1/chat/completions` and `/v1/completions` request bodies are
    dropped (with a warning) before pydantic validation, instead of
    returning a `400`. Useful when an upstream proxy sends vendor-specific
    fields that MAX Serve does not need to honor.

  Both flags default to `False`; the existing strict behavior is
  unchanged. The corresponding `400` error messages now reference the new
  flags. As a side effect, the legacy `/v1/completions` route now surfaces
  `InputError` detail strings to the client instead of the generic
  `"Value error."` message.

* MAX Serve now emits the `maxserve.num_requests_queued` OTel/Prometheus
  metric (changed from an `UpDownCounter` to a synchronous `Gauge`). The
  gauge is sampled once per scheduler iteration from
  `BatchMetrics.publish_metrics` and reports the depth of the scheduler's
  CE / prefill queue (the same value as the `Pending: N reqs` line in
  scheduler logs). It is published by every text-path scheduler that
  drives `BatchMetrics`: `TokenGenerationScheduler` and `PrefillScheduler`
  (via `TextBatchConstructor`), and `DecodeScheduler` (via
  `len(pending_reqs) + len(prefill_reqs)`). Operators can use this metric
  to observe queue buildup during overload conditions.

* Added a `"none"` option for `runtime.tool_parser` and
  `runtime.reasoning_parser` in `PipelineConfig` (CLI flags `--tool-parser`
  and `--reasoning-parser`). Pass `none` (case-insensitive) to explicitly
  disable the parser, overriding any architecture-declared default. Leaving
  the field unset still applies the architecture default as before.

* Added the `nemotron-opencode` benchmark dataset backed by
  `nvidia/Nemotron-SFT-OpenCode-v1`. Each row is a full Qwen3-Coder OpenCode
  trace (system prompt, multi-turn user/assistant/tool messages, and tool
  schemas). Multi-GB per subset, so the loader streams via
  `datasets.load_dataset(..., streaming=True)` and pulls only enough rows to
  satisfy `--num-prompts`. Tool definitions per row are surfaced on
  `NemotronOpenCodeBenchmarkDataset.last_loaded_tool_schemas` and (for
  single-turn) attached to `SampledRequest.tools`.

* Benchmark request payloads now forward an OpenAI-style `tools=[...]` field
  on chat-completions requests. `SampledRequest` and `RequestFuncInput` gained
  a `tools: list[dict] | None = None` field;
  `OpenAIChatCompletionsRequestDriver` serialises it into the POST body when
  set. Datasets that supply per-row tool schemas (currently
  `nemotron-opencode`) now exercise the server's tool-call grammar /
  structured-output path end-to-end. Pass `enable_tool_calls=False` on
  Nemotron-OpenCode to suppress forwarding.

* Removed multi-step decode from the text-generation pipelines. The flag
  `--max-num-steps` no longer works.

#### `max` CLI

* The serving benchmark now reports a per-turn KV cache retention percentile
  metric for multi-turn workloads. For each turn after the first, it compares
  the server-reported cached prefix against the block-aligned prefix carried
  over from the previous turn, surfacing when cached tokens are dropped between
  turns (distinct from the existing cached-token-rate metrics, whose denominator
  includes new and uncacheable tokens). The KV cache block size used to align
  the expected prefix is configurable via `--kv-block-size` (default `128`);
  match it to the server's `--kv-cache-page-size`.
* Added `--devices=gpu:all` to use every visible GPU (including MAX Serve).
* Removed the `default` value for `--devices`; omit `--devices` to use the model
  or config default.
* The serving benchmark entrypoint (`benchmark_serving`) now defaults `--seed`
  to a fixed value instead of drawing a fresh random seed on each run. The seed
  drives the workload generator (input/output lengths, session structure,
  content), so a fixed default makes repeated and scheduled runs reproducible
  and keeps run-to-run deltas reflecting the change under test rather than
  workload-draw variance. To opt back into a fresh seed, pass `--seed none` on
  the CLI (or `seed: null` in a workload/config YAML); the drawn seed is logged
  and recorded with the results so the run stays reproducible after the fact.
* Added `--profile` to `max pipelines generate` for rudimentary,
  one-command profiling. With Nsight Systems (`nsys`) on `PATH` and an
  NVIDIA GPU, the timed run is captured into an `.nsys-rep` file and a
  ranked top-N GPU kernel summary is printed. Without `nsys`, a Python/CPU
  profile is produced from `cProfile`. The capture window is bounded by
  `cudaProfilerStart`/`Stop` so warmup and graph-compile time are excluded.
  Use `--profile-output` to override the report path.
* Added `--profile` to `max pipelines benchmark` as a synonym for
  `--trace` that also prints a ranked top-N GPU kernel summary at the end
  of the run. The server still needs to be launched under `nsys launch`
  (matching the existing `--trace` requirement); `--profile` removes the
  "now run `nsys stats` by hand" step.

#### Python API

* Reduced default signal buffer size from 1025 to 257 MiB per GPU and fixed
  miscalculation of required space in `MOGGKernelAPI.mojo`. Calculation was
  wrong by a factor of `1/num_devices` since each device only needs scratch
  for its own portion of the collective problem. Reduces footprint for current
  heaviest workload (Kimi-K2.5 with `BlockCopyEngine`) from 16GB to 4GB.

* Added `max.driver.CompletionFlag`, an 8-byte completion flag in pinned host
  memory mapped into a device's address space. Lets host code signal a GPU
  stream (or peer host observer) by writing a 64-bit value to a single
  location visible to both. Currently CUDA-only; constructing against any
  other backend raises `RuntimeError`.

* Added `Device.__unsafe_enqueue_async_py_host_func(fn, flag, value, cpu)`
  and `DeviceStream.wait_for_host_value(flag, value)` for dispatching a
  Python callable onto an explicit AsyncRT worker pool from a host-function
  node and gating the GPU stream on its completion (via the
  `CompletionFlag`). The kickoff trampoline returns immediately, letting
  the GPU stream proceed concurrently with the worker; a downstream
  `wait_for_host_value` blocks the stream until the worker stores `value`.
  The `__unsafe_` prefix marks that the API has no safety net for
  callbacks that capture state outliving the compiled graph.

* Added the `mo.wait_host_value` graph op and the
  `max.nn.kernels.wait_host_value()` Python helper that wraps it. Stalls
  the device stream until a 64-bit host-visible flag reaches a given
  value; lowers to CUDA's `cuStreamWaitValue64` and captures cleanly into
  a CUDA graph as a wait-value node. Lets a captured forward graph gate
  a downstream consumer kernel on CPU-produced data while the rest of
  the forward body runs concurrently. Pair with `mo.launch_host_func`
  or `Device.__unsafe_enqueue_async_py_host_func` to issue the host
  work whose completion the consumer waits on.

* Added two new nanobind types to `max._core.engine` that split the
  compile-and-load pipeline at the type level:

  * `CompiledModels` represents the compile artifact returned by
    `compile_from_path` / `compile_from_object` on the
    `max._core.engine.InferenceSession` binding (these methods don't exist on
    the public `max.engine.InferenceSession` class). It holds the MEF bytes
    and one or more sub-models; it is not directly executable.
  * `ModelMetadata` exposes per-sub-model metadata (`name`,
    `input_metadata`, `output_metadata`) and is yielded by iterating a
    `CompiledModels` or indexing it with `[i]`.

  `Model` continues to represent the runnable, post-init handle (still
  produced by `InferenceSession._load_all`). The high-level
  `max.engine.CompiledModel` wrapper now holds a `CompiledModels` instance
  internally.

* Increased the default allreduce signal buffer size from 513 MiB to 1025 MiB
  per GPU (`max.nn.comm.allreduce.Signals.NUM_BYTES` and the matching constant
  in `max.experimental.realization_context`). The previous 512 MiB scratch
  could not hold the per-peer allgather intermediate for models with large
  hidden dimensions (for example, Kimi-K2.5 at `hidden_dim=20480` with
  `max-batch-input-tokens=16384` needs 640 MiB in bf16). This adds \~512 MiB
  of per-GPU memory use for any multi-GPU model.

* Added `max.experimental.functional.ceil`, an element-wise unary op that
  rounds each element of a floating-point tensor up toward positive infinity.
  Complements the existing `floor`, `round`, and `trunc` ops.

* `max.experimental.functional.while_loop` now passes `Tensor` (not
  `TensorValue`) into its `predicate` and `body` callbacks. Callbacks can
  use ordinary `Tensor` operations directly, without wrapping arguments
  via `Tensor.from_graph_value(...)` or reaching for the
  underscore-prefixed `_graph_value` attribute on returns.

* `max.experimental.nn.Module.compile()` now emits the same
  `Building and compiling {ClassName}... / Still building... / Building
  {ClassName} graph took Ns / Compiling {ClassName} took Ms / Building and
  compiling {ClassName} took Ts` log sequence that pipeline-level
  `CompilationTimer` produces today, and wraps the compile body in
  `max.profiler.Tracer` spans (`Module.compile({ClassName})`,
  `Module.compile.trace`, `Module.compile.session_load`) so an `nsys` capture
  with `MODULAR_ENABLE_PROFILING=1` shows compilation as named ranges.
  Every ModuleV3 caller — including pixel-generation pipelines that previously
  compiled silently — now gets this observability for free. The outer
  `CompilationTimer("model")` wrappers in `*_modulev3` architectures have been
  removed to avoid nested timing logs.

* `max.experimental.nn.Module.load_state_dict` and
  `Module.compile(weights=...)` now accept an `auto_cast` keyword
  (default `False`). The framework remains strict by default. When
  `auto_cast=True` is passed, loaded weights are automatically cast
  between `float32` and `bfloat16` when shapes match, logging a single
  summary message per load instead of raising. Other dtype mismatches
  (`float16`, `fp8`, `fp4`, integers, etc.) continue to raise as before.
  This removes the need for per-adapter `astype` shims when checkpoint
  dtypes differ from the module's declared parameter dtype. MAX
  pipelines opt in via the `MODULAR_AUTO_CAST_WEIGHTS` environment
  variable (default `true`, parsed by
  `max.pipelines.lib.weight_loading.auto_cast_weights_from_env`).

* `CPUMetricsCollector` in `max.diagnostics.cpu` is now used as a context
  manager instead of `start`/`stop` and now exposes `get_stats()` instead of
  `dump_stats()`, matching the interface of `GPUDiagContext`.

* `max.graph.Module` is now a public class for grouping multiple `Graph`
  instances into a single compilation unit, replacing the previous alias
  for the underlying MLIR module. Construct one with `Module()` and pass
  it as the `module=` argument to each `Graph`; the resulting `Module` is
  what you hand to `InferenceSession.load_all` to compile every graph
  together. `Graph.empty_module()` has been removed in favor of `Module()`,
  and `Graph` now exposes a `module` property returning the `Module` it
  belongs to.

* `InferenceSession.load_all` now returns a `dict[str, Model]` keyed by each
  model's `sym_name` (the name of its `mo.graph` op), instead of a
  `list[Model]` ordered by MEF position. The accepted input type also gained
  `max.graph.Module`, so callers can compile a pre-built module containing
  multiple `mo.graph` ops directly. `Model` now exposes a `name` property.

  Migrate positional unpacking call sites by indexing the returned dict:

  ```python
  # Before
  module = Graph.empty_module()
  with Graph("vision", input_types=..., module=module): ...
  with Graph("language", input_types=..., module=module): ...
  vision_model, language_model = session.load_all(graph, ...)

  # After
  module = Module()
  with Graph("vision", input_types=..., module=module) as vision_graph: ...
  with Graph("language", input_types=..., module=module) as language_graph: ...
  models = session.load_all(module, ...)
  vision_model = models[vision_graph.name]
  language_model = models[language_graph.name]
  ```

### MAX kernels

* The `use_blocking_impl` parameter has been removed from the `foreach` custom
  op helper (and the underlying `elementwise` primitive), and the analogous
  `single_thread_blocking_override` parameter has been removed from the `concat`
  and `concat_shape` kernels and the reduction-based kernels. Work is always
  dispatched the same way, with a single worker used automatically when the
  problem size is small. The dedicated small-tensor `concat` fast path has been
  removed in favor of the existing serial/parallel dispatch.
* Updated `elementwise` call sites across MAX kernels and benchmarks to use
  `Coord`-native indexing, fixing compile failures caused by invalid
  `Coord`/`IndexList` conversions.
* Enabled Programmatic Dependent Launch (PDL) for the SM100 (Blackwell)
  FlashAttention-4 prefill kernel, letting back-to-back attention grids in a
  stream overlap launch and prologue latency. This reduces per-launch overhead
  most for shorter sequences (measured \~1.05x–1.5x faster on B200, bf16,
  head\_dim=128 across seq lengths 128–2048). On by default; disable with
  `-D MHA_PDL=false`.
* Added a simdgroup-tiled matmul kernel for the Apple M5 GPU, bringing
  neural-accelerator-backed matmul to the MAX framework. In-range MAX matmuls
  (`m >= 64`, `n >= 64`, `k >= 16`; ragged K supported) now use it: fp16/bf16
  always, and fp32 a/b by default (accepting the simdgroup MMA's fp19
  truncation). Set `MODULAR_APPLE_M5_ALLOW_LOSSY_F32_MATMUL=0` for the precise
  naive fp32 path.

### Breaking changes

* KV cache management has moved from `max.kv_cache` to `max.pipelines.kv_cache`.
  Update imports accordingly:

  ```python
  # Before
  from max.kv_cache import PagedKVCacheManager, DummyKVCache

  # After
  from max.pipelines.kv_cache import PagedKVCacheManager, DummyKVCache
  ```

  Deprecation shims with `DeprecationWarning` remain at the old path.

* Custom Mojo ops used through `max.experimental.torch.CustomOpLibrary` (and
  the rest of the graph-compiler custom-op path) must now declare their
  `ctx` parameter as `DeviceContext` instead of `DeviceContextPtr`. The
  `DeviceContextPtr` type has been removed from the Mojo standard library;
  see the [Mojo nightly
  changelog](https://docs.modular.com/mojo/changelog.md) entry under
  *Removed* for the full migration. Multi-device ops should declare their
  variadic context argument as `DeviceContextList[N]` (also new — see the
  Mojo changelog *GPU programming* section).

* GPU and CPU diagnostic tooling has moved from `max.diagnostics` to
  `max.profiler`: `max.diagnostics.gpu` → `max.profiler.gpu` and
  `max.diagnostics.cpu` → `max.profiler.cpu`. Update imports accordingly.
  Deprecation shims with `DeprecationWarning` remain at the old paths.

* `max/python/max/benchmark/benchmark_throughput.py`, deprecated in v0.26.3,
  has been removed.

### Fixes

* Fixed structured output (`response_format: json_schema` and grammar-guided
  tool calling) intermittently emitting raw control characters inside JSON
  string values on models that use a byte-level BPE (TikToken) tokenizer,
  producing invalid JSON. The constrained-decoding adapter fed llguidance the
  tokens' byte->unicode *surface* bytes (e.g. a raw newline rendered as `Ċ`)
  instead of their true bytes, so the grammar mask admitted control-char
  tokens as legal string content. Token bytes are now recovered via the
  tokenizer's `byte_decoder`, so raw control characters are correctly
  excluded. Fast-tokenizer checkpoints were unaffected.

* Fixed an expert-parallelism dispatch assertion (`Cannot dispatch EP
  kernel with N input tokens when the maximum tokens per rank is N-1`)
  that fired whenever `--max-batch-input-tokens` was not evenly
  divisible by the tensor-parallel degree. The EP per-rank cap now uses
  ceiling division to match the ragged binning of `reducescatter` in
  TP-attention + EP-MoE mode, so the largest shard fits in the
  dispatch buffer. Affects DeepSeek-V3, Kimi-K2.5, MiniMax-M2, Qwen3,
  and Step3.5 deployments configured with non-divisible batch sizes.

* `MODULAR_DEBUG=ir-output-dir=<dir>` (and the equivalent
  `[max-debug] ir-output-dir = <dir>` config-file entry and
  `InferenceSession.debug.ir_output_dir = <dir>` Python setter) now
  actually dumps per-stage MLIR files to the configured directory. The
  option was previously parsed but no compiler stage consulted it, so
  users had to fall back to the legacy `MODULAR_MAX_TEMPS_DIR` env var.
  Both spellings are now honored.

### Mojo language

For all the updates to the Mojo language, standard library, and tools,
see the [Mojo release notes](https://mojolang.org/releases).

## v26.3 (2026-05-07)

* [Highlights](#26-3-highlights)
* [Documentation](#26-3-docs)
* [MAX models](#26-3-models)
* [MAX framework](#26-3-max)
  * [Inference server](#26-3-max-serve)
  * [`max` CLI](#26-3-max-cli)
  * [Python API](#26-3-max-python)
  * [Custom ops](#26-3-custom-ops)
* [MAX kernels](#26-3-max-kernels)
* [Breaking changes](#26-3-breaking)
* [Fixed](#26-3-fixed)
* [Mojo language](#26-3-mojo)

### Highlights {#26-3-highlights}

* MAX now supports **video generation** with Wan 2.1 / 2.2 diffusion
  models, including image-to-video and video-to-video pipelines.

* New API for **multi-GPU model execution from Python**: the
  [`max.experimental.sharding`](https://docs.modular.com/max/api/python/generated/max.experimental.sharding)
  module lets a single `Module.compile()` call distribute a model across a
  `DeviceMesh` using `Replicated`, `Sharded`, and `Partial` placement
  primitives. Gemma 3 ModuleV3 is the first multi-GPU model on this path.

* The MAX NVFP4 grouped matmul kernel now **outperforms FlashInfer on
  B200** across all tested decoding and prefill shapes for Kimi K2.5.

### Documentation {#26-3-docs}

* Restructured the [MAX LLM book](https://llm.modular.com) around how to
  deploy a custom model with `max serve`.
* Added new model developer guides covering
  [broadcasting](https://docs.modular.com/max/develop/broadcasting.md),
  [indexing](https://docs.modular.com/max/develop/indexing.md), and the
  [model bring-up workflow](https://docs.modular.com/max/develop/model-bringup-workflow.md).
* Added a [graph overview](https://docs.modular.com/max/develop/graph.md) and a new
  [graph and modules guide](https://docs.modular.com/max/develop/modules.md).
* Added [model debugging guides](https://docs.modular.com/max/develop/debugging.md) for accuracy, errors,
  GPU, and tracing.
* Updated the [speculative decoding](https://docs.modular.com/max/serve/speculative-decoding.md) guide.
* Updated the guide to
  [serve custom models](https://docs.modular.com/max/develop/serve-custom-model-architectures.md).
* Added API docs for
  [`max.pipelines.architectures`](https://docs.modular.com/max/api/python/pipelines.architectures).
* Redesigned [REST API reference](https://docs.modular.com/max/rest-api.md), now built with Scalar.

### MAX models {#26-3-models}

* The `residual_threshold` parameter for FLUX first-block cache (FBCache) is
  now a per-request runtime parameter on `ImageProviderOptions`, allowing it
  to be tuned without recompiling the model graph.

* Added the Mamba state space model architecture.

* Added the Step-3.5-Flash architecture.

* Added the Qwen-Image and Qwen-Image-Edit text-to-image architectures.

* Added the Z-Image and Z-Image-Turbo text-to-image architectures.

* **MiniMax-M2 and MiniMax-M2.7:**
  * Added MiniMax-M2 and MiniMax-M2.7 architecture support, including FP8
    weights, the lightning-attention hybrid backbone, and 4×H100 multi-GPU
    serving.
  * Enabled DP+EP execution paths for MiniMax MoE layers, with automatic
    overlap scheduling and device-graph capture.
  * Added per-rank token-limit checks and reduced input-offset device round
    trips on the MiniMax decode path.

* **Gemma 4 and Gemma 3 ModuleV3:**
  * Added the Gemma 4 architecture (ModuleV2), including multimodal vision
    support.
  * Added the Gemma 3 ModuleV3 implementation with multi-GPU support via
    the DTensor / `DistributedTensorType` compile path.
  * Fixed token-offset and prompt-image alignment regressions in Gemma 4
    multimodal prefill, plus assorted Gemma 3 ModuleV3 performance fixes.

* **Qwen3 and Qwen3-VL:**
  * Added Qwen3 and Qwen3-VL architecture support, including the MoE variant
    and multimodal vision input.

* **Wan video diffusion:**
  * Fixed Wan 2.1 / 2.2 video diffusion pipelines silently running without
    classifier-free guidance. The tokenizer gated negative-prompt tokenization
    on `true_cfg_scale > 1.0` (default `1.0`), so negative tokens were never
    produced and the executor fell back to unguided generation even when
    `guidance_scale > 1.0` and a negative prompt were supplied. Wan now enables
    classical CFG whenever `guidance_scale > 1.0` and defaults an absent
    negative prompt to the empty string, matching the diffusers baseline.
  * Added the UniPC multistep scheduler for Wan diffusion.
  * Added Wan image-to-video and video-to-video pipeline variants, plus
    additional generation kwargs and prompt-handling fixes.

* **FLUX.2:**
  * Added TaylorSeer denoising cache support to the FLUX.2 Klein pipeline,
    enabling significant speedups for image-to-image generation by skipping
    redundant transformer passes during the denoising loop.
  * Added TeaCache support to `DiffusionPipeline` as a peer of TaylorSeer.
  * Added FLUX.2 ModuleV2 pipeline, FLUX.2 Klein support, NVFP4 quantization,
    aspect-ratio preserving image preprocessing, and BFL checkpoint weight
    fixes.

* **Kimi K2.5 vision:**
  * Improved Kimi K2.5 multimodal support, including vision encoder fixes
    and tokenizer parity with the upstream model.

* **DeepSeek V3 and Kimi K2.5 distributed execution:**
  * Improved tensor-parallel and expert-parallel execution paths for
    DeepSeek V3 and Kimi K2.5, including subgraph deduplication, MoE dispatch
    tuning, and reduced compile-time overhead.

### MAX framework {#26-3-max}

#### Inference server {#26-3-max-serve}

* Added periodic "still building/compiling" log messages during model
  compilation so that long operations produce visible signs of progress.

* Consolidated KV connector CLI flags (`--host-kvcache-swap-space-gb`,
  `--disk-offload-dir`, `--disk-offload-max-gb`, `--disk-offload-direct-io`,
  `--lmcache-config-file`) into the `--kv-connector-config` JSON dict.

* Removed the `--allow-safetensors-weights-fp32-bf16-bidirectional-cast` CLI
  flag. Float32 <-> bfloat16 safetensors weight casting is now unconditionally
  enabled.

* Added `--model-override` CLI flag for per-component `ModelManifest` overrides
  (e.g. `--model-override transformer.quantization_encoding=float4_e2m1fnx2`),
  enabling mixed quantization in diffusion pipelines.

* Removed jump forward decoding (`compute_ff_tokens`) from structured output.
  The bitmask constraint alone ensures valid structured output, matching the
  approach used by vLLM and SGLang.

* Added `json_object` response-format support to MAX Serve structured output
  via `/v1/chat/completions`.

* Improved error handling for image request failures in MAX Serve.

* Added multi-step and overlap-scheduler support for structured output in the
  `TextGenerationPipeline`. Extended tokenizer support to include TikToken-based
  tokenizers, enabling structured output with Kimi K2.5.

* Improved cached-token reporting, fixed cache hit/miss metrics to emit only
  on context-encoding batches, moved a subset of telemetry from detailed to
  basic, and added per-draft-position acceptance-rate logging for speculative
  decoding.

* Tightened the `MODULAR_MAX_SERVE_*` environment-variable prefix; unprefixed
  overrides previously honored by `max-serve` no longer apply.

* Added `min_p` and `top_k` sampling controls and additional
  chat-completion kwargs to the OpenAI-compatible routes.

* **Unified EAGLE speculative decoding:**
  * Added unified EAGLE pipelines for Llama 3, DeepSeek V3 + MTP, and Kimi
    K2.5, sharing a single PipelineModel.
  * Added support for `--num-speculative-tokens > 1` across the unified EAGLE
    Llama, DeepSeek+MTP, and Kimi+EAGLE paths.
  * Added overlap-scheduler support for unified EAGLE, including multi-GPU
    DP setups (e.g. DP Kimi).
  * Enabled CUDA graphs for EAGLE and MTP.

* **Distributed KV transfer (dKV):**
  * Added the `DKVConnector` with NIXL transfer support for the distributed
    KV cache.
  * Unified KV connector configuration under `--kv-connector-config`.
  * Added EFA compatibility, disconnect support, parent-hash eviction, and
    per-connector metrics for the dKV transfer engine.
  * Added a configurable decode-stall watchdog for 1P1D deployments.
  * Added disk-location support to the Python dKV client.

* **Heterogeneous serving and overlap scheduling:**
  * Added two-phase prefill execution under the overlap scheduler for the
    distributed-inference (DI) prefill role.
  * Auto-enabled overlap scheduling for DI pipeline roles and disabled
    auto device-graph capture for prefill-only workers.
  * Added support for heterogeneous TP prefill / DP decode in MLA KV
    transfer (e.g. `tp4` prefill into a DP decode pool).

#### `max` CLI {#26-3-max-cli}

* Added sweep benchmarking capabilities to `max benchmark`: iterate over
  multiple concurrency and request-rate combinations, flush the prefix cache
  between runs, and collect per-run structured JSON results.
* Standardized the `--model` flag across `max serve`, `max generate`,
  `max encode`, and `max warm-cache`.
* Improved `max serve` CLI flag descriptions.

#### Python API {#26-3-max-python}

* Added `Model.release_captured_graph()`, which drops a previously captured
  device graph identified by graph key (or per-device keys) and frees its
  device-side working memory once any in-flight replay completes. Releasing a
  key that was never captured is a no-op. Callers remain responsible for
  dropping any output `Buffer` handles returned by the corresponding
  `Model.capture()` call.

* Added `ops.roi_align` (with `F.roi_align` functional wrapper) for ROI Align
  pooling over NHWC inputs, with configurable spatial scale, sampling ratio,
  alignment mode, and AVG/MAX pooling. Includes a matching MO eager handler.

* Added MO eager handlers for `ConstantExternalOp`, `ConstantScalarOp`,
  `ReduceRmsNormOp`, and `ReduceGroupNormOp`, so graphs with external
  weights, scalar constants, RMS norm, or group norm run eagerly without
  falling back to compilation.

* Fixed tensor slicing with negative integer indices (e.g. `hidden[:, -1]`)
  which previously raised a `RuntimeError` at compile time.

* Fixed `ops.reshape` / `TensorValue.reshape` rejecting valid `-1` reshapes
  on tensors whose leading dim is a symbolic sum-of-products (e.g.
  `[(batch_size * num_steps) + total_seq_len, 1536]` reshaped to
  `[-1, n_heads, head_dim]` with `n_heads * head_dim == 1536`). The inferred
  dim now simplifies without requiring a `rebind`.

* Setting `MODULAR_MAX_DEBUG_UNINITIALIZED_READ_CHECK=true` (or the
  `max-debug.uninitialized-read-check` config key, or
  `InferenceSession.debug.uninitialized_read_check = True`) enables detection
  of uninitialized memory reads in Mojo kernels. `InferenceSession`
  automatically enables the debug allocator poison and compiles kernels with
  load-time poison checks for all float types. When a load matches a poison
  pattern, the process aborts with a descriptive message.

* Added support for the `bfloat16` data type on ARM CPU devices in MAX graphs.
  Previously, `session.load()` raised a `ValueError` when a graph contained
  bf16 tensors targeting an ARM CPU.

* Added `DevicePlacementPolicy` (`Ignore`, `Warn`, `Error`) to `Graph` to
  control behavior when CPU-only ops (`ops.scatter`, `ops.cumsum`,
  `ops.nonzero`, `ops.tile`) receive GPU tensors. The default (`Warn`) emits a
  `UserWarning` and falls back to CPU; `Error` raises `ValueError` instead.
  `ops.cond` and `ops.while_loop` always raise `ValueError` for GPU predicates.

* Fixed slow `axis=None` reductions (`mean`, `sum`, `prod`, `max`, `min`) in
  `max.experimental.functional`. The previous implementation flattened the
  tensor before reducing, serializing the work onto a single GPU block.
  Reductions now iterate axis-by-axis to preserve parallelism.

* Renamed the public quantization APIs from `Float8*` to `Quant*` (including
  `Float8Config` → `QuantConfig`, `parse_float8_config()` →
  `parse_quant_config()`, and the `quant` modules in `max.nn` and
  `max.pipelines.lib`), reflecting that the config now covers FP8, NVFP4,
  and MXFP4 quantization.

* `max.diagnostics.gpu.BackgroundRecorder`'s sampling interval can now be
  configured.

* Introduced `CPUMetrics` alongside the existing GPU diagnostics and open source
  it under from `max.diagnostics`.

* Added `Model.kernel_summaries` for inspecting compiled kernels through the
  Python API.

* Added a unified `DebugConfig` Python class (with nanobind bindings) and
  exposed `DebugConfig` and `GraphDebugConfig` in `max.engine` and
  `max.graph`.

* Added a graph API for initializing and registering the runtime context
  (`M::Context`) from Python.

* Improved `max.experimental.functional.custom`: compiled custom-op kernels
  are now cached, and eager-mode `F.custom` no longer recompiles on every
  call.

* Fixed `Module.compile()` when unrealized tensors are used as weights.

* Added the `InputModality` enum for specifying model input types and
  threaded it through the multimodal pipeline architectures.

* Updated `Tensor.to()` and `Module.to()` to accept distributed device
  targets, including `DeviceMapping` and `DeviceMesh`.

* `max.experimental.Tensor` is now distribution-aware: it carries a
  tuple of per-shard storages, `driver.Buffer`s (realized) or graph
  values (`TensorValue` / `BufferValue`, unrealized), paired with a
  `DeviceMapping` that maps those local shards onto the
  `DeviceMesh`.

* Reworked `max.experimental.functional` from a single `functional.py`
  into a `functional/` package, a new distribution-and mesh-aware
  dispatch layer on top of the graph-compiler Python API, split cleanly
  into three op categories: `creation_ops` (tensor factories), `spmd_ops`
  (rule-based per-op SPMD dispatch), and `collective_ops`
  (`allreduce_sum`, `allgather`, `reduce_scatter` etc., now applied per
  device-group along a chosen mesh axis so they dispatch correctly on
  multi-dimensional meshes, plus a `transfer_to` convenience op
  between `DeviceMapping`s).

* Added `max.experimental.sharding` with the core types for distributed
  tensors (`DeviceMesh`; `DeviceMapping` with `PlacementMapping` and
  `NamedMapping`; placement primitives `Replicated` / `Sharded` /
  `Partial`; `DistributedTensorType` / `DistributedBufferType`;
  `TensorLayout`), plus a `sharding.rules` submodule of pure
  mapping-propagation rules (elementwise, matmul, reduction, shape,
  conv, pooling) that, for each op, either error out or reshard inputs
  to the proposed `DeviceMapping`s and derive the resulting output
  `DeviceMapping`.

* `max.experimental.nn.Module.compile()` now accepts
  `DistributedTensorType` symbolic inputs (not just `TensorType`), so
  distributed models can be built via the graph-compilation path in
  addition to running eagerly; `gemma3_modulev3` is the first multi-GPU
  model wired up. DTensor support in MAX is still ongoing work and
  these APIs may evolve.

* Added new graph ops (with matching `max.experimental.functional` wrappers):
  `scatter_max`, `scatter_min`, `scatter_mul`, `scatter_nd_max`,
  `scatter_nd_min`, `scatter_nd_mul`, `non_maximum_suppression`,
  `resize_linear`, `resize_nearest`, and `resize_bicubic`. The existing
  `max.graph.ops.resize` now delegates to these for `BILINEAR`, `NEAREST`,
  and `BICUBIC` interpolation modes. `max.graph.ops.pad` (and the functional
  wrapper) also accepts `mode='reflect'` and `mode='edge'` in addition to
  `mode='constant'`.

* Expanded experimental eager-interpreter coverage so significantly more
  graphs run end-to-end without falling back to compilation. Added handlers
  for `gather`, `gather_nd`, `argmax`, `argmin`, `split`, `scatter`,
  `scatter_nd`, `scatter_nd_add`, `scatter_add`, `scatter_max`, `scatter_min`,
  `scatter_mul`, `scatter_nd_max`, `scatter_nd_min`, `scatter_nd_mul`, `tile`,
  `band_part`, `top_k`, `bottom_k`, `nonzero`, `non_maximum_suppression`,
  `pad` (constant on CPU/GPU; reflect and edge on CPU), `conv2d`,
  `conv2d_transpose`, `max_pool2d`, `avg_pool2d` (floor and ceil mode),
  `resize_linear`, `resize_nearest`, `resize_bicubic`, `mo.mutable.store`,
  `mo.mutable.store.slice`, and the distributed collectives
  `distributed.allreduce.sum`, `distributed.allgather`, `distributed.scatter`,
  `distributed.broadcast`, and `distributed.reducescatter.sum`. Most run on
  both CPU and GPU; CPU-only handlers are noted as such.

* Rewrote the eager-interpreter `mo.mutable.store.slice` handler to write
  slices via a device-side Mojo kernel instead of a host numpy round-trip.
  GPU buffers no longer round-trip D→H→D on every call, and `bfloat16` and
  `float8_*` dtypes are now supported (`float4_e2m1fn` remains unsupported).

* Added defensive eager-interpreter handlers for `mo.shape.from_tensor`,
  `mo.index.to_tensor`, `mo.buffer.create`, `mo.buffer.transfer`, and
  `mo.gather_sum` so eager runs no longer crash if these internal ops survive
  canonicalization.

* Improved experimental eager-interpreter performance by enabling
  multi-threaded CPU execution and removing unnecessary GPU device
  synchronization between op dispatches.

* Added `max.nn.StackedLinear` for QKV-style stacked projections, with a
  fused (`stacked=True`) and an unfused (`stacked=False`) layout. Unfused
  mode opts into a new `Module._omit_module_attr_name` flag, which drops
  the wrapper's own attribute name from descendant weight FQNs, so a
  `self.qkv_proj = StackedLinear(names=["q_proj", "k_proj", "v_proj"],
  stacked=False)` exposes weights at `self_attn.q_proj.weight` rather
  than `self_attn.qkv_proj.q_proj.weight`. This lets HuggingFace
  checkpoint names flow into models without per-architecture remapping
  in their `weight_adapters.py`.

* `Module.compile()` now accepts a `custom_extensions` parameter for loading
  custom Mojo kernel libraries at graph construction time, fixing validation
  failures for kernels with struct-level parameters.

* Fixed `torch.compile(fullgraph=True)` failing with an "Unsupported context
  manager" error when accessing `CustomOpLibrary` ops inside the compiled
  function. Ops are now eagerly compiled during library initialization.

* **Runtime and device graph performance:**
  * Reduced device-graph launch overhead for single-graph models.
  * Parallelized device-graph instantiation and moved instantiation off the
    main execution threads.
  * Added parallel device-graph launches and a task-ID hint on AsyncRT
    algorithms.
  * Added a GPU health check during `DeviceContext` initialization.
  * Added NaN/Inf detection at compiled-region boundaries.
  * Improved Metal driver support with custom statuses and Metal log capture
    for Apple GPU print output.
  * Made `CPUDeviceContext` asynchronous and added `enqueue_cpu_function` /
    `enqueue_cpu_range` helpers for CPU kernel execution.
  * Auto-enabled device-graph capture for DeepSeek V3, Kimi, and Kimi K2.5
    serving paths.

#### Custom ops {#26-3-custom-ops}

* Added host-function and in-place memcpy custom ops, including
  `mo.launch_host_func`, `mo.inplace_memcpy`, an `enqueueHostFunc` Mojo
  binding on `DeviceStream`, and a `cuLaunchHostFunc` binding for the
  CUDA device stream.

### MAX kernels {#26-3-max-kernels}

* Added GPU kernel examples from the *Programming Massively Parallel Processors*
  (PMPP) textbook covering reductions, scans, histograms, sorting, sparse
  matrix operations, graph algorithms, convolutions, FlashAttention, and more.

* Improved NVFP4 grouped matmul kernel performance, now outperforming FlashInfer
  across all tested decoding and prefill shapes for Kimi K2.5 on B200.

* Optimized GPU `layer_norm` kernels with SIMD reductions, gamma/beta
  prefetch, and a `simd_width*2` warp tiling dispatch path.

* Optimized GPU `pad_constant` kernel with SIMD vectorization (`simd_width=4`)
  and added a kbench benchmark suite (`bench_pad`).

* Improved GPU `topk` and `argsort` kernel performance by nearly 2x.

* Optimized GPU `concat` with a flat-indexing kernel that avoids
  multi-dimensional index decomposition, using 128-bit vectorized loads with
  automatic fallback for unaligned shapes.

* Optimized GPU `topk` stage-1 kernel with a per-thread register heap that
  caches the top-8 elements during a single scan pass, eliminating redundant
  global memory re-reads for the first 8 extraction iterations.

* Moved `partial_simd_load` and `partial_simd_store` from
  `buffer.buffer` to `linalg.utils` and removed the `buffer` package. Update
  imports from `from buffer.buffer import ...` to
  `from linalg.utils import ...`.

* **Blackwell (SM100) GPU performance:**
  * Enabled the Mojo SM100 GEMM by default.
  * Added MXFP4 and MXFP8 block-scaled matmul on SM100, plus a `KIND_MXF4`
    execution path.
  * Added a general grouped block-scaled matmul dispatch and MXFP4 support
    for the grouped path.
  * Enabled PDL for SM100 grouped NVFP4 / MXFP4 / MXFP8 GMM.
  * Improved the SM100 GEMV dispatcher and added GEMV split-K for GEMMs with
    small `M` and `N`.
  * Increased the SM100 GEMM `C`-tile `N` dispatch up to 64.

* **AMD GPU performance:**
  * Added B300 support, including device-agnostic default block counts for
    allreduce and allgather.
  * Added a CDNA4 block-scaled MFMA wrapper.
  * Added MI355X TileTensor MHA (about +13% prefill at depth 128) and
    TileTensor-based AMD attention kernels generally.
  * Always enabled the gfx950 MHA prefill kernel and modernized AMD MHA/MLA
    decode with 16x16 MMA and FP8.
  * Added depth-512 paths for AMD RDNA GPUs and a 2-D convolution kernel for
    RDNA 3+ GPUs.
  * Added MXFP4 matmul and grouped matmul support on AMD.

* **Attention and state-space kernels:**
  * Added sparse MLA decode (with qbf16 / FP8 KV variants) for SM100.
  * Added speculative-decoding sequence-length folding with `numhead` for the
    TP MLA decode dispatch.
  * Added gated delta-rule recurrence kernels for hybrid-attention models.

* **Expert-parallel (EP) kernels:**
  * Added multi-device MO ops for EP dispatch and combine.
  * Added a grouped dynamic NVFP4 quantization kernel for MoE.
  * Added MXFP4 support to `ep.dispatch` and the
    `mo.distributed.ep.dispatch.mxfp4` op.
  * Added a `skip_a2a` mode to EP dispatch and combine.
  * Fixed AMD GPU atomics in EP dispatch.

* **Collective communication kernels:**
  * Unified the multimem and standard code paths in `ReduceScatter`.
  * Enabled PDL for allgather and updated `ReduceScatter` to use `with_PDL()`.
  * Launched allgather kernels in parallel and set the allgather block count
    via a tuning table.
  * Added support for non-multiples of SIMD width in allreduce.

* **Fused transformer kernels:**
  * Added a fused `rope_split_store` kernel and wired it into
    `AttentionWithRope`.
  * Added a fused RMSNorm + RoPE GPU kernel and a graph-compiler fusion
    pattern for `mo.reduce.rms_norm.RoPE`.
  * Added a GEMV + partial RMSNorm fusion path.

### Breaking changes {#26-3-breaking}

* Removed individual KV connector CLI flags (`--host-kvcache-swap-space-gb`,
  `--disk-offload-dir`, `--disk-offload-max-gb`, `--disk-offload-direct-io`,
  `--lmcache-config-file`). Use `--kv-connector-config` with a JSON dict
  instead.

* `max/python/max/benchmark/benchmark_throughput.py` has been deprecated and
  will be removed in a future MAX release.

* Removed `Dim` and `DimList` types from `buffer.dimlist`. Custom kernel code
  using these types should migrate to `IntTuple` and `TileLayout` from the
  `layout` package.

* Removed `PreTrainedPipelineTokenizer`. Use the standard pipeline tokenizer
  resolution path instead.

* Moved `DenoisingCacheConfig` from `PipelineConfig` to
  `PipelineRuntimeConfig`. Update call sites that constructed
  `PipelineConfig(denoising_cache_config=...)` to set the field on
  `PipelineRuntimeConfig` instead.

* Replaced `FluxPipelineOutput` and `Flux2PipelineOutput` with a unified
  `DiffusionPipelineOutput`. Code that imports the old output types must
  switch to `DiffusionPipelineOutput`.

* `PipelineConfig` now expects a `models=ModelManifest(...)` configuration
  for multi-component pipelines (transformer, VAE, text encoders, etc.).
  Pipelines that previously passed individual model paths or configs at the
  top level must migrate to a `ModelManifest`.

* `max-serve` now requires the `MODULAR_MAX_SERVE_*` prefix for environment
  overrides. Unprefixed environment variables previously honored by
  `max-serve` no longer apply.

### Fixed {#26-3-fixed}

* Fixed MAX tools aborting at startup with
  `std::filesystem::filesystem_error` when `$HOME` is not traversable by the
  running UID (common in containerized CI where the image's build-time UID
  differs from the runtime UID). The config search now treats permission
  errors as "not found" and falls through to the next candidate.
  ([Issue #6412](https://github.com/modular/modular/issues/6412))

* Fixed `enqueue_fill()` taking O(N) HIP API calls for `float64` buffers on
  AMD GPUs when the high and low 32-bit halves of the fill value differ (e.g.,
  `2.0`), reducing the call count to O(log N).
  ([Issue #6417](https://github.com/modular/modular/issues/6417))

* Fixed integer indexing into a graph tensor (e.g. `x[0]` on a `(2, 3)`
  tensor) failing graph compilation with
  `'mo.static.reshape' op input and output elements do not match`. A
  reshape-through-slice optimization pattern was incorrectly rewriting
  the slice + squeeze pattern produced by integer indexing, generating a
  reshape whose element count did not match the input.
  ([Issue #6440](https://github.com/modular/modular/issues/6440))

### Mojo language {#26-3-mojo}

For all the updates to the Mojo language, standard library, and tools, see the
[Mojo release notes](https://mojolang.org/releases/)

## v26.2 (2026-03-19)

* [Highlights](#26-2-highlights)
* [Documentation](#26-2-docs)
* [MAX models](#26-2-models)
* [MAX framework](#26-2-max)
  * [Inference server](#26-2-max-serve)
  * [`max` CLI](#26-2-max-cli)
  * [Python API](#26-2-max-python)
  * [Breaking changes](#26-2-breaking)
* [MAX kernels](#26-2-max-kernels)
* [Mojo language](#26-2-mojo)

### Highlights {#26-2-highlights}

* MAX now supports **image generation** with FLUX diffusion models
  (`FLUX.1-dev` and `FLUX.2-dev`), served through a new `/v1/responses` endpoint
  with the [OpenResponses API](https://www.openresponses.org/reference). See the
  [image generation guide](https://docs.modular.com/max/inference/image-generation.md) to get started.

* Significant **DeepSeek improvements**: added support for DeepSeekV3.2 with
  multi-latent attention, NVFP4 quantization support for DeepSeek-R1 (with expert
  parallelism), and expert parallelism now supports more than 32 local experts
  without requiring NVSHMEM for single-node deployments.

* Major **Blackwell (SM100) kernel optimizations**, including
  [SnapMLA](https://arxiv.org/abs/2602.10718) for MLA decode, hardware-accelerated
  conv2d with TMA im2col for FLUX VAE, fused epilogues in BF16 and FP8 matmul
  kernels, and FP8 MMA support for MLA prefill with blockwise scaling.

### Documentation {#26-2-docs}

* Refactored the [MAX Python API reference](https://docs.modular.com/max/api/python.md) into a flat list
  of module pages. Each summary page organizes APIs based on conceptual groups
  instead of source file locations. All API members also include a direct link to
  the source code on GitHub.

* Added [Basic operations](https://docs.modular.com/max/develop/basic-ops.md) to the model developer
  guide, covering tensor arithmetic, shape manipulation, reductions, matrix
  operations, activation functions, and random tensor generation.

* Added [Model pipeline](https://docs.modular.com/max/develop/pipelines.md) to the model developer
  guide, explaining how to connect models to MAX's serving infrastructure with
  inference pipelines that handle weight loading, KV cache management, request
  batching, and tokenization.

* Added [Image generation](https://docs.modular.com/max/inference/image-generation.md) to the inference
  guide, showing how to generate images from text prompts or transform existing
  images using the `v1/responses` endpoint with FLUX models.

* Added the [Environment variables](https://docs.modular.com/max/environment-variables.md) reference,
  documenting all configurable MAX environment variables for server settings,
  logging, telemetry, debugging, performance, and Hugging Face integration.

### MAX models {#26-2-models}

* Added support for FLUX image generation models
  (`black-forest-labs/FLUX.1-dev` and `FLUX.2-dev`). Supports fused graph
  compilation, batched VAE decoding, GPU-side post-processing, and first-block
  caching for repeated prompts.

* Added support for Kimi vision-language models (`moonshotai/Kimi-K2.5` and
  `Kimi-VL-A3B-Instruct`). Supports multi-GPU tensor parallelism, a custom vision
  processor, learnable 2D position embeddings, and tiktoken tokenizer.

* Added support for OLMo 3 models (`Olmo3ForCausalLM`), for example
  `allenai/Olmo-3-7B-Instruct`.

* Added support for Qwen3-MoE models (`Qwen3MoeForCausalLM`), for example
  `Qwen/Qwen3-30B-A3B-Instruct`, with multi-GPU tensor parallelism and FP8
  quantization support.

* DeepSeek improvements:
  * Added support for the DeepSeekV3.2 architecture with multi-latent attention
    and fused FP8 paged KV cache.
  * Added NVFP4 quantization support for DeepSeek-R1, including with expert
    parallelism.
  * Expert parallelism now supports more than 32 local experts and no longer
    requires NVSHMEM for single-node deployments.
  * Improved memory estimation for NVFP4-quantized models and EP communication
    buffers.
  * Added FP4 quantization support for the DeepSeek MTP speculative decoding
    module.
  * Various fixes: decode-only mode, missing `rope_scaling` config,
    DeepSeek-V2-Lite gather-index OOB, re-enabled multi-GPU TP for
    DeepSeek-V2-Lite-Chat.

* Removed legacy Gemma 3 multimodal implementation and the
  `MODULAR_MAX_DISABLE_GEMMA3_VISION` environment variable.

* Fixed multi-GPU tensor parallelism for GPT-OSS MoE models.

* Common MAX models like Qwen 2.5 can now run on AMD RDNA consumer GPUs.

* Improved Mistral3 text encoder performance by compiling hidden-state selection
  and eliminating redundant GPU transfers.

* Fixed prompt validator for Qwen2.5-VL models.

* Fixed audio generator pipeline to restore audio generation support.

* Fixed multi-GPU NVFP4 inference for Llama3.

* Fixed Idefics3 chat template image placeholder ordering.

* Added MXFP4 quantization support for GPT-OSS models (such as
  `openai/gpt-oss-20b`).

### MAX framework {#26-2-max}

* Upgraded the bundled `libnvptxcompiler` from CUDA 12.9 to CUDA 13.1, which
  requires NVIDIA GPU driver 580 or higher. This brings the latest bug fixes and
  performance improvements from NVIDIA's PTX compiler, as well as fully
  supporting new hardware like the DGX Spark and Jetson Thor.

  To use MAX and Mojo with older NVIDIA drivers and
  hardware, you can set the `MODULAR_NVPTX_COMPILER_PATH` environment
  variable to point to a system `ptxas` binary, instead of using the bundled
  `libnvptxcompiler` version.

  The Mojo `DeviceContext()` constructor now checks NVIDIA driver compatibility
  at creation time and provides a clear error message when the driver version
  is too old, matching the behavior of the Python `Accelerator()` API.

* Runtime GPU errors now include a Python source traceback, showing where
  the failing operation was defined in your graph-building code. Build with
  `MODULAR_MAX_DEBUG=True` to enable source note collection; when source notes
  aren't available, error messages include a hint about how to enable them.

* Added `MODULAR_DEBUG_DEVICE_ALLOCATOR` environment variable for debugging
  GPU memory issues. Set to `uninitialized-poison` to fill buffers with
  sentinel values (qNaN for floats, `0xCD` for others) to detect use of
  uninitialized data, or `out-of-bounds` to enable redzone checks for
  buffer overflows. Accepts a comma-separated list for multiple options.

* Fixed a memory leak in CUDA graph execution where output buffers were not
  freed between replays, causing GPU memory to grow over time during
  sustained inference.

* Fixed compilation cache misses when cross-compiling GPTQ and LoRA models
  on machines without a GPU. Weight dtype casting now skips the actual data
  conversion in virtual device mode, because only compilation metadata is
  needed.

* Enabled peer-to-peer device memory access for AMD HIP multi-GPU
  configurations, enabling direct GPU-to-GPU memory transfers on AMD
  hardware.

* Fixed multi-GPU communication silently falling back to a slower transport on
  systems where `rdma-core` is installed without dev packages (common in
  production containers).

* Fixed multi-GPU broadcast operations failing with "Broadcast currently
  requires P2P access between GPUs," due to a regression in peer-to-peer
  device access initialization.

* Improved Hugging Face model downloads: gated repo errors now surface
  clearly instead of showing a misleading "check the repo name" messages.

#### Inference server {#26-2-max-serve}

* Added image generation support via a new `/v1/responses` endpoint
  implementing the [OpenResponses](https://www.openresponses.org/reference) API
  standard. Enable it by adding `responses` to `MAX_SERVE_API_TYPES` (for example,
  `MAX_SERVE_API_TYPES='["openai","responses"]'`). Currently supports FLUX
  diffusion models. For more information, see the [image generation
  guide](https://docs.modular.com/max/inference/image-generation.md).

* Added `output_format` parameter to image generation requests, allowing clients
  to choose JPEG, PNG, or WEBP output per request (default remains JPEG).

* Overlap scheduling is now auto-enabled for select model architectures
  like `LlamaForCausalLM_Legacy`, and is compatible with prefix caching. This
  reduces CPU overhead by overlapping Python host code with GPU kernel execution.
  It's currently incompatible with some features such as structured outputs and
  CPU models. It's still experimental and you can disable it with
  `--no-enable-overlap-scheduler --force`.

* Speculative decoding improvements:
  * Added typical-acceptance rejection sampling.
  * Added `rejection-sampling-strategy` option (`greedy` or `residual`) for
    speculative decoding. Defaults to `residual`; use `greedy` for models
    that pass hidden states.
  * Applied repetition/frequency/presence penalty sampling in EAGLE.
  * Enabled weight sharing between MTP draft and main model to reduce memory.
  * Added support for chunked prefill with EAGLE and MTP speculative decoding.
  * Fixed batch context length calculation for draft models.
  * Fixed Eagle penalty inputs being unconditionally applied.

* EAGLE speculative decoding now reports the draft token acceptance rate in
  scheduler metrics output.

* Added KV cache offloading: KV cache blocks can now spill from GPU to CPU
  memory and disk when GPU memory is full, enabling larger effective cache
  capacity and warm restarts. Includes
  [LMCache](https://github.com/LMCache/LMCache) integration for sharing
  KV cache across model instances via external storage (CPU, disk, Redis),
  with multi-GPU tensor parallelism support.

* CUDA graph capture is now auto-enabled for Llama models when
  `max_batch_size` is set, reducing per-token latency. You can opt out with
  `--no-device-graph-capture --force`.

* Added FP8 quantization support for the KV cache, reducing KV cache memory
  usage. Configure via `--kv-cache-format float8_e4m3fn` (also supports
  `float32` and `bfloat16`).

* Added configurable batch scheduling strategy for text generation via the
  `MAX_SERVE_BATCH_PRIORITY` environment variable. It defines how the
  scheduler prioritizes between prefill (context encoding) and decode
  (token generation) when constructing batches. Options: `prefill_first`
  (minimize time-to-first-token), `decode_first` (minimize inter-token
  latency), `balanced` (adaptive based on global queue state), or
  `per_replica` (each replica decides independently; default).

* Diffusion models can now specify a default `num_inference_steps` per
  architecture.

* Added `--first-block-caching` flag to enable first-block caching (FBCache) for
  diffusion models like FLUX, and `--residual-threshold` for the TaylorSeer
  caching strategy. Both are configurable via `max serve` and `max generate`.

* Enabled `logprobs` in chat completion responses, returning per-token
  log probabilities.

* Non-streaming requests are now cancelled when the client disconnects,
  preventing zombie requests from consuming KV cache memory.

* Improved streaming performance by buffering generated tokens and
  detokenizing them in batches rather than one at a time, reducing CPU
  overhead and improving GPU utilization.

* Improved multi-GPU AllReduce performance by launching per-device
  kernels in parallel async tasks instead of sequentially.

* Fixed a server hang when a model worker process crashes before it finishes
  initializing.

* Fixed per-request seed handling in TopK/TopP sampling. Seeds are now
  correctly applied per request instead of using a single seed for the
  entire batch.

* Fixed KV cache blocks not being released after offline text generation
  (`generate()` / `generate_async()`), which could cause block exhaustion
  during sustained inference.

* Fixed three resource leaks in the disaggregated inference decode scheduler: KV
  cache blocks leaked on request cancellation, replica load-balancing counter
  drift over time, and a `KeyError` crash on stale prefill responses arriving
  after cancellation.

#### `max` CLI {#26-2-max-cli}

* Added the `--device-graph-capture` flag to enable CUDA graph capture for
  serving, reducing per-token latency by replaying recorded GPU kernel launches.
  Auto-enabled for Llama and DeepSeek V3; opt out with
  `--no-device-graph-capture --force`.
* Added the `--debug-verify-replay` flag to run eager launch-trace verification
  before device graph replay, for debugging CUDA graph correctness issues.
* Added the `--kv-cache-format` flag to set the KV cache data type at runtime.
  Accepts `float32`, `bfloat16`, or `float8_e4m3fn` for FP8 quantized caching.
* Added the `--lmcache-config-file` flag to enable
  [LMCache](https://github.com/LMCache/LMCache)-based external KV cache
  tiering. Point it at an LMCache YAML config to share KV cache blocks across
  model instances via CPU, disk, or remote storage.
* Added the `--reasoning-parser` flag to `max serve` to enable extraction of
  model thinking/reasoning content into a separate `reasoning` field on the
  OpenAI API response. Currently supports Kimi K2.5 (`kimi-k2`), with a
  registry for adding additional parsers.
* Added the `--rejection-sampling-strategy` flag to select the rejection
  sampling method for speculative decoding. Options: `greedy`, `residual`
  (default for standalone), or `typical-acceptance` (default for EAGLE/MTP).
  Use `greedy` for models that pass hidden states.
* `max benchmark` now uses the model's default temperature when none is
  specified.
* `max benchmark` no longer overrides `top_p` unless the user provides a value.
* Removed the `--cache-strategy` flag.

#### Python API {#26-2-max-python}

* `Tensor.constant()` is deprecated. Use the `Tensor(data, dtype=...,
  device=...)` constructor directly, matching PyTorch's `torch.tensor()`
  semantics. For example, replace `Tensor.constant([1.0, 2.0])` with
  `Tensor([1.0, 2.0])`. `Tensor.constant()` will be removed in a future
  release.

* `DeviceEvent` now accepts an `enable_timing=True` parameter to enable GPU
  event timing. Use `start.elapsed_time(end)` to measure elapsed GPU time in
  milliseconds between two timing-enabled events.

* Added the `prod` op for computing the product of elements along an axis,
  available as `max.graph.ops.prod`, `max.experimental.functional.prod`, and
  `Tensor.prod()`.

* `Device.stats` now includes `graph_mem_reserved` and `graph_mem_used` fields
  for device graph memory observability.

* `Module.compile()` now validates weight names, dtypes, and shapes before
  loading, surfacing mismatches as Python errors instead of runtime
  crashes during asynchronous host-to-device transfers.

* `InferenceSession` now automatically includes the CPU in its device list,
  removing the need to manually add it when graphs include host-side values.

* Added `max.graph.ops.broadcast` for distributed broadcast across devices.
  Raises `ValueError` when `signal_buffers` is empty.

* Added manual synchronization API (`DevicePinnedBuffer`, `DeviceEvent`) for
  controlling buffer readiness and reducing stream synchronization overhead.

* `Tensor.cast()` is now idempotent for same-dtype casts.

* Added `F.cond` to the experimental functional API for conditional execution.

* Added fast path for `Tensor.to(device)` in eager mode.

* Added `Dim`-based scalar dimension API to `Module.compile()`.

* `Module` is now device-aware via `to()` for unified device placement.

* `Module.load_state_dict()` now validates weight attribute names.

* Algebraic dims and graph/custom op construction now works without an explicit
  context manager, by using a global MLIR context. Threadpool-backed MAX paths
  now scope worker-thread MLIR usage to the default context automatically.

* Renamed `Float8Config` to `QuantConfig` (and related types/functions)
  to reflect that the config now covers FP8, NVFP4, and MXFP4 quantization.

* Renamed related public Python quantization APIs from `Float8*` names to
  `Quant*` names, including `parse_float8_config()` to
  `parse_quant_config()`, and the public `quant` modules in `max.nn` and
  `max.pipelines.lib`.

* `max.diagnostics.gpu.BackgroundRecorder`'s sampling interval can now be
  configured.

#### Breaking changes {#26-2-breaking}

* Reorganized `max.nn` namespace. The graph-based neural network API has
  been restored as the default `max.nn` namespace (previously located under
  `max.nn.legacy`). The eager module API has moved from `max.nn` to
  `max.nn.module_v3`. Additionally, `max.tensor`, `max.functional`, and
  `max.random` have moved back under `max.experimental`
  (`max.experimental.tensor`, `max.experimental.functional`,
  `max.experimental.random`). Update imports accordingly.

* Moved experimental APIs under `max.experimental`. Two additional packages
  have moved under the `max.experimental` namespace to co-locate all
  experimental APIs:

  * `max.torch` is now `max.experimental.torch`. Update imports from
    `from max.torch import CustomOpLibrary, graph_op` to
    `from max.experimental.torch import CustomOpLibrary, graph_op`.

  * `max.nn.module_v3` is now `max.experimental.nn` (the `v3` suffix has been
    dropped). Update imports from
    `from max.nn.module_v3 import Module, Linear` to
    `from max.experimental.nn import Module, Linear`.

* Removed `PipelineConfig.max_length`. The `max_length` parameter now resides at
  the model configuration level as `MAXModelConfig.max_length` (accessible as
  `config.model.max_length`). This change correctly places the parameter at the
  model level since it describes model capacity (maximum sequence length the
  model can process), not pipeline runtime behavior. Update all configurations
  and code to use `model.max_length` instead of the removed `max_length` field
  at the pipeline level.

* `PipelineModel` no longer accepts the `encoding` parameter. The
  `encoding` parameter has been removed from `PipelineModel.__init__` and all
  subclasses. The encoding is now automatically inferred from
  `pipeline_config.model.quantization_encoding`. This change eliminates
  redundant parameter passing and ensures a single source of truth for
  quantization encoding configuration.

* Device-graph APIs now require explicit caller-provided graph keys for
  capture/replay/verification. Update calls from
  `model.capture(*inputs)`, `model.replay(*inputs)`, and
  `model.debug_verify_replay(*inputs)` to
  `model.capture(graph_key, *inputs)`, `model.replay(graph_key, *inputs)`,
  and `model.debug_verify_replay(graph_key, *inputs)`.

* Removed `q_max_seq_len` from `KVCacheParams`; accepted via graph capture
  instead.

* `MAXBaseModel` now uses `extra=forbid` and `strict=True`; configs with
  unknown fields will be rejected.

* Replaced `disable_auto_sync`/`mark_as_ready` with `DevicePinnedBuffer` and
  `DeviceEvent` for pinned memory management.

### MAX kernels {#26-2-max-kernels}

* **Blackwell (SM100) GPU performance**:
  * Optimized Attention on SM100 by skipping unnecessary softmax
    corrections when the row maximum change is small.
  * Fused epilogue into SM100 BF16 and FP8 matmul kernels.
  * Improved SM100 FP8 matmul dispatch for small M shapes (M <= 128).
  * Fixed matmul kernel dispatch on SM100.
  * Added SM100 hardware-accelerated conv2d with TMA im2col and fused residual
    epilogue for FLUX VAE.
  * Added batched BF16 matmul support for SM100.
  * Added [SnapMLA](https://arxiv.org/abs/2602.10718) implementation for
    SM100 MLA decode.
  * Added FP8 tensorwise and block-scale MLA decode for SM100/B200.
  * Added FP8 MMA support for MLA prefill with blockwise scaling and K RoPE.
  * Enabled MLA attention for SM100 GPUs.
  * Enabled 64x256 N split MMA for B200 MLA decode (long context).
  * Used TMA for KV scale loads in attention kernels (SM100).

* **AMD GPU kernel improvements**:
  * Tuned and optimized GEMV split-K BF16 dispatch and kernel for AMD GPUs.
  * Enabled FP8 GEMV kernel on AMD GPUs.
  * Reduced K buffer bank conflicts in MHA prefill on AMD via swizzle.
  * Integrated AMD pingpong kernel with FP8 dispatch and fixed TP > 1.
  * Fixed out-of-bounds masking and depths > 256 on AMD RDNA GPUs.
  * Enabled rocSHMEM GDA backend with TCP bootstrap for multi-node AMD EP.

* **Grouped matmul improvements** (SM100):
  * Added MMA\_N=64 support for 1D1D block-scaled grouped matmul.
  * Added 2SM support to structured 1D1D grouped matmul kernel.
  * Enabled swapAB for block-scaled grouped matmul and block-scaled matmul on
    SM100.
  * Added tensor scale factor to block-scaled 1D1D grouped matmul.
  * Added bf16 scales support to blockwise FP8 grouped matmul.

* **DeepSeek kernel optimizations**:
  * Added BF16 MLA prefill/decode mega-kernel.
  * Enabled BF16 graph execution path for Multi-Latent Attention.
  * Enabled fused QKV projection for latent attention with RoPE.
  * Fused RoPE and RMSNorm into MLA custom ops.
  * Fused epilogue operations in DeepSeek BF16 matmul kernels.
  * Added fused dispatch and combine kernels for expert parallelism.
  * Enabled Mojo BF16 matmul kernels and FP4 kernels for DeepSeek shapes.
  * Fixed blockwise FP8 batched matmul for non-row-major layouts.

* **Multi-GPU distributed ops**:
  * Added fused allreduce + RMSNorm + FP8 kernel with residual path and 2-stage
    allreduce for tensor-parallel workloads.
  * Added distributed scatter graph op for multi-GPU DP>1 inference.
  * Fixed and optimized broadcast kernel for BF16/FP16 with multimem on GPU.
  * Fixed and optimized 2-stage broadcast kernel for multi-GPU.

* **FLUX kernel improvements**: Autotuned cuDNN convolution algorithm selection
  and cached results. Added multi-block GroupNorm GPU kernel. Enabled
  high-performance Mojo matmul kernels for FLUX.2. Fixed grouped conv2d on GPU
  incorrectly ignoring the `num_groups` parameter.

* `kbench` now runs benchmarks via shared library (`.so`) by default, reusing
  persistent workers and CUDA contexts instead of spawning subprocesses.
  Benchmark execution phase is \~10x faster (for example, 4.25 h → 0.4 h on a
  tuning workload). Falls back to subprocess mode when profiling or using custom
  exec wrappers.

* Added MXFP4 dequant and matmul kernels.

* Optimized FP4 matmul dispatch for Llama-style shapes and added FP4 GEMM
  dispatch configs for additional shape coverage.

* Used asynchronous FP4 quantization kernel for improved throughput.

* Optimized Hopper matmul for M=256 and small M shapes via swapAB.

* Improved GEMV kernel performance. Integrated Flash Infer TopK kernel for
  improved sampling performance.

* Improved layer normalization kernel performance.

* Added FP8 support to FlashMLA decode kernel.

* Fixed FP8 cast lambda epilogue in matmul.

* Fixed NaN in MLA decode split-K kernel with causal masking.

* Fixed warpgroup deadlock in MLA decode that could cause hangs on DeepSeek
  models.

* Fixed incorrect MoE expert routing caused by bitonic sort merge direction bug.

* Fixed int8 matmul dispatch on ARM64.

* Fixed Metal buffer tracking for sub-buffers and tensor slices on Apple
  Silicon.

### Mojo language {#26-2-mojo}

For all the updates to the Mojo language, standard library, and tools,
including all GPU programming and `Layout`/`LayoutTensor` changes, see the [Mojo
changelog](https://mojolang.org/releases)

## v26.1 (2026-01-29)

* [Highlights](#26-1-highlights)
* [Documentation](#26-1-docs)
* [MAX models](#26-1-models)
* [MAX framework](#26-1-max)
  * [Inference server](#26-1-max-serve)
  * [`max` CLI](#26-1-max-cli)
  * [Python API](#26-1-max-python)
* [MAX kernels](#26-1-max-kernels)
* [Mojo language](#26-1-mojo)

### Highlights {#26-1-highlights}

The eager-style [`Tensor`](https://docs.modular.com/max/api/python/tensor.md#max.tensor.Tensor) and
[`Module`](https://docs.modular.com/max/api/python/generated/max.nn.Module) APIs are
now the primary API for model development, providing a PyTorch-like development
experience:

```python
from max import functional as F
from max.tensor import Tensor
from max.dtype import DType

x = Tensor.constant([1.0, -2.0, 3.0, -4.0, 5.0], dtype=DType.float16)
y = F.relu(x)
print(y)
# Tensor([1 0 3 0 5], dtype=DType.float16, device=Device(type=gpu,id=0))
```

If you want explicit control over the graph structure, you can
still build models with the
[`Graph`](https://docs.modular.com/max/api/python/generated/max.graph.Graph) APIs.

For more details, see the [model developer guide](https://docs.modular.com/max/develop.md).

### Documentation {#26-1-docs}

* The fully refactored [MAX LLM book](https://llm.modular.com/) is now designed
  so the code you write in each exercise incrementally builds upon the last one,
  until you've built an executable GPT-2 model with the MAX Python API.

* New model developer guide introduces [eager-style
  programming](https://docs.modular.com/max/develop.md), [tensor APIs](https://docs.modular.com/max/develop/tensors.md), and [data
  types](https://docs.modular.com/max/develop/dtypes.md). Much more is coming soon.

* New guide to [profile MAX on GPUs with `nsys`](https://docs.modular.com/max/gpu-system-profiling.md).

* Extended [documentation for
  `kbench`](https://github.com/modular/modular/tree/main/max/kernels/benchmarks/autotune#kbench-a-benchmarking-toolkit-for-mojo-kernels),
  a Python tool to benchmark, autotune, and analyze MAX kernel performance.

### MAX models {#26-1-models}

* [Gemma3](https://builds.modular.com/models/gemma-3-it/27B) now supports
  vision input (multimodal) in the 12B and 27B variants, including support for
  local file paths and structured output. Learn more in the [image to text
  guide](https://docs.modular.com/max/inference/image-to-text.md).

* Added `Qwen/Qwen3-VL-4B-Instruct` and `Qwen/Qwen3-VL-2B-Instruct`
  model architectures.

* Removed Llama 3.2 Vision (`Llama-3.2-11B-Vision-Instruct`) architecture
  support. Use other vision models such as Pixtral, InternVL, Qwen2.5-VL, and
  Gemma3.

### MAX framework {#26-1-max}

* All Python wheels are now hosted at `https://whl.modular.com/nightly/simple/`.
  If using `uv`, change `--index-url` to `--index`, and if using `pip`, change
  to `--extra-index-url`. For precise commands, see the
  [install guide](https://docs.modular.com/max/packages.md#install).

#### Inference server {#26-1-max-serve}

* Improved scheduling to achieve higher KVCache utilization and batch sizes. By
  default, MAX now schedules a context encoding (CE) request only if KVCache
  memory is less than 95% full *after* allocating blocks for that request or if no
  active requests exist. You can adjust this watermark value (`0.95`) with
  [`--kvcache-ce-watermark`](https://docs.modular.com/max/cli/serve.md#--kvcache-ce-watermark-kvcache_ce_watermark).
  Beware that increasing it causes more preemptions.

* When running models with data-parallelism (DP), the semantics of max batch
  size has changed. For example, when specifying `--data-parallel-degree 8` and
  `--max-batch-size 32` it previously meant that each data-parallel replica
  could have at most 4 requests for an aggregate max batch size of 32. We
  changed this so that now the CLI flag specifies the max batch size per
  replica. This means the aggregate max batch size of the above values is
  8\*32=256 requests. This aligns with vLLM and other inference engines.

* `--max-ce-batch-size` is now deprecated. The cap on batch size is now uniform
  between context encoding and token generation phases of text generation. Use
  `--max-batch-size` instead.

* The API server now returns chunked tokens from the model worker, reducing
  overhead and significantly improving throughput for small models and
  decode-heavy workloads.

* Server stats collection (`collect_server_stats`) is now enabled by default for
  serving benchmarks.

#### `max` CLI {#26-1-max-cli}

* The `max generate` command now applies the model's chat template internally
  when using `--prompt`. This more closely aligns with how users typically
  prompt a model for testing and ensures special tokens are properly filtered
  from output.

* Added tracing flags to `max benchmark` for `nsys` profiling:

  * `--trace`: Enable tracing of the benchmark run (currently NVIDIA GPUs only)
  * `--trace-file`: Path to save the trace file
  * `--trace-session`: Optional session name for tracing

  Requires the server to be run under `nsys launch`. Using
  `--gpu-profiling detailed` is recommended.

#### Python API {#26-1-max-python}

* The eager-style [`Tensor`](https://docs.modular.com/max/api/python/tensor.md#max.tensor.Tensor) APIs are
  now the primary API for model development, providing a PyTorch-like development
  experience.

  We moved the eager-style tensor APIs out of `experimental` and
  reorganized the `max.nn` module to make the eager module
  system the primary API (`nn.module_v3` is now `nn.module`).

  The previous [`max.nn`](https://docs.modular.com/max/api/python/nn.md) components are still available
  for backward compatibility in [`max.nn.legacy`](https://docs.modular.com/max/api/python/nn.md).

* Renamed `max.driver.Tensor` to
  [`max.driver.Buffer`](https://docs.modular.com/max/api/python/driver.md#max.driver.Buffer) to clarify that
  it represents a low-level memory buffer, not a tensor. The
  [`max.tensor.Tensor`](https://docs.modular.com/max/api/python/tensor.md#max.tensor.Tensor) class remains
  the primary tensor type.

* Added `forward()` method to
  [`Module`](https://docs.modular.com/max/api/python/generated/max.nn.Module) to compute the
  output—it behaves the same as invoking the object as a callable (the
  `__call__()` method).

* `accelerator_count()` now returns a non-zero value when called on an Apple
  silicon system. This means you can use this code:

  ```python
  device = CPU() if accelerator_count() == 0 else Accelerator()
  ```

  And it defaults to using the available Apple silicon GPU. As a consequence,
  MAX graphs should in most cases be dispatched to run on Apple silicon GPUs.
  Note that most MAX models do not yet work on Apple silicon GPUs due to
  missing hardware-specific kernel pathways and other support, but this is an
  important step towards enabling MAX more broadly on Apple silicon GPUs.

* Added `max.nn.module.rope` containing rotary embedding implementations,
  [`RotaryEmbedding`](https://docs.modular.com/max/api/python/generated/max.nn.RotaryEmbedding) and
  [`TransposedRotaryEmbedding`](https://docs.modular.com/max/api/python/nn.md).

* Added
  [`ArchConfig`](https://docs.modular.com/max/api/python/pipelines.lib.interfaces#max.pipelines.lib.interfaces.ArchConfig)
  and `ArchConfigWithKVCache`. Going forward, models that register with the MAX
  architecture registry must define a config that implements this protocol

* Added `ops.complex.mul` for multiplying complex-valued tensors

* Added `calculate_virtual_device_count()`,
  `calculate_virtual_device_count_from_cli()`, `load_max_buffer()` to
  [`max.driver`](https://docs.modular.com/max/api/python/driver.md).

* Added [`TokenBuffer`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.TokenBuffer)
  for token management.

* Renamed `prefill_chunk_size` to `max_batch_input_tokens`
  and `max_batch_context_length` to `max_batch_total_tokens`
  in [`PipelineConfig`](https://docs.modular.com/max/api/python/generated/max.pipelines.PipelineConfig)
  and `TTSConfig` classes to better reflect their purpose in batch memory
  management.

  The corresponding CLI flags have also been renamed:
  `--prefill-chunk-size` is now `--max-batch-input-tokens` and
  `--max-batch-context-length` is now `--max-batch-total-tokens`.

* Fixed `max.driver.Buffer.to(stream)` to not copy (it return reference to
  the same tensor) when the stream is on the same device, even for GPU-pinned
  host memory.

* Removed deprecated `max.nn` convolution classes: `Conv2dV1`, `Conv1DV1`,
  `Conv3DV1`. Use `Conv2d`, `Conv1D`, `Conv3D` instead.

* Removed deprecated `max.nn` layer classes: `LinearV1`, `QLinearV1`,
  `GPTQLinearV1`, `MLPV1`, `EmbeddingV1`, `LayerNormV1`, `RMSNormV1`. Use
  `Linear`, `GPTQLinear`, `MLP`, `Embedding`, `LayerNorm`, `RMSNorm` instead.

* Removed `max.engine.MojoValue`

* Removed the deprecated `custom_ops_path` parameter from
  [`InferenceSession.load()`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession.load).
  Instead use the `custom_extensions` parameter.

* Added `graph.ops.shard_and_stack()`

* Removed unused `graph.weights.PytorchWeights`

### MAX kernels {#26-1-max-kernels}

* Improved performance for Hopper Matmul when using skinny M shapes. In
  particular
  when M is between 2 and 64, we see a significant performance boost for specific
  shapes ranging between 10 - 40%.

* Added swapAB optimization to Hopper Matmul, performs B x A and does a
  transposed
  write to C. This helps when you need more granularity in the M dimension.

* Refined `create_stream` API: all streams are now non-blocking (`blocking`
  argument has been removed). Explicitly use `DeviceEvent` and `synchronize()`
  wherever necessary.

### Mojo language {#26-1-mojo}

For all the updates to the Mojo language, standard library, and tools,
including all GPU programming and `Layout`/`LayoutTensor` changes, see the [Mojo
changelog](https://mojolang.org/releases)

## v25.7 (2025-11-20)

* [Highlights](#25-7-highlights)
* [Documentation](#25-7-docs)
* [MAX models](#25-7-models)
* [MAX framework](#25-7-max)
  * [`max` CLI](#25-7-max-cli)
  * [Python API](#25-7-max-python)
  * [Mojo API](#25-7-max-mojo)
* [Mojo language](#25-7-mojo)

### Highlights {#25-7-highlights}

* The MAX Python API is now [fully open-sourced on
  GitHub](https://github.com/modular/modular/tree/main/max/python/max)!

  As we expand our [model
  repository](https://builds.modular.com/?category=models), we're making
  significant progress on these APIs to simplify the effort to build
  production-ready GenAI models in Python. Some APIs are still experimental,
  but you can [build an LLM with it today](https://llm.modular.com).

### Documentation {#25-7-docs}

* New online book to [build an LLM from scratch with
  MAX](https://llm.modular.com), using our **experimental model APIs**. This is a
  guided lesson to building GPT-2 with our Python API, explaining each component
  of the transformer model along the way. Like the Python APIs, the book is a
  work in progress—please [report any issues in
  GitHub](https://github.com/modular/max-llm-book/issues).

* All the planned parts of [GPU Puzzles](https://puzzles.modular.com/) are now
  complete! Support for Apple silicon GPUs is also making [steady
  progress](https://puzzles.modular.com/howto.html#gpu-support-matrix).

* Tutorials on docs.modular.com are now integrated into the
  [Guides](https://docs.modular.com/max/intro.md) section, indicated with a book icon in the left
  navigation.

* The [`max` CLI docs](https://docs.modular.com/max/cli.md) are now generated from [the CLI
  source](https://github.com/modular/modular/blob/main/max/python/max/entrypoints/pipelines.py).

### MAX models {#25-7-models}

* Gemma3 now supports logprobs.

### MAX framework {#25-7-max}

* Added support for bfloat16 models running on GPUs with ARM-based CPU hosts,
  such as Grace Hopper (GH200) and Grace Blackwell (GB200).
* Updated minimum NVIDIA GPU driver requirement to 580.

#### `max` CLI {#25-7-max-cli}

* [`max benchmark`](https://docs.modular.com/max/cli/benchmark.md) can now run LoRA benchmarking for
  supported models and target modules.

* `max benchmark --collect-gpu-stats` can now collect AMD
  GPU statistics.

* `max serve --do-penalties` was renamed to `--enable-penalties` and enabled by
  default. To disable penalties, you can specify
  [`--no-enable-penalties`](https://docs.modular.com/max/cli/serve.md#--enable-penalties---no-enable-penalties)

#### Python API {#25-7-max-python}

* Added support for Python 3.14.

* Removed support for Python 3.9.

* All MAX Python API modules are now **open-sourced**. In addition to those
  previously released, we've added `driver`, `dtype`, `engine`, `experimental`,
  `interfaces`, `kv_cache`, `mlir`, `nn`, `profiler`, `support`, `torch`, and
  more [in our GitHub
  repo](https://github.com/modular/modular/tree/main/max/python/max).

* Added [`max.profiler`](https://docs.modular.com/max/api/python/profiler.md) module with the
  [`Tracer`](https://docs.modular.com/max/api/python/profiler.md#max.profiler.Tracer) class to create and
  manage profiling spans based on runtime conditions, and the
  \[\`@traced()] decorator to profile a whole function.

* Added [`max.diagnostics.gpu`](https://docs.modular.com/max/api/python/diagnostics.gpu) APIs to expose
  common GPU statistics as might be reported by `nvidia-smi` or `rocm-smi`.

* Added the [`max.kv_cache`](https://docs.modular.com/max/api/python/kv_cache.md) package, which provides
  APIs to manage key-value caches used in transformer models. Not to be confused
  with the existing [`max.nn.kv_cache`](https://docs.modular.com/max/api/python/nn.kv_cache) package that
  includes kernels for KV caching.

* Removed the `KVCacheManager` class and combined it with the single
  [`PagedKVCacheManager`](https://docs.modular.com/max/api/python/generated/max.kv_cache.PagedKVCacheManager)
  implementation. During merger, `prefetch()` was renamed `maybe_reserve()`.

* Added
  [`NullKVCacheManager`](https://docs.modular.com/max/api/python/generated/max.kv_cache.DummyKVCache)
  for compile-only mode, which avoids GPU memory allocation when compiling models
  without a physical GPU present.

* Added
  [`ResetPrefixCacheBackend`](https://docs.modular.com/max/api/python/kv_cache.md)
  and
  [`ResetPrefixCacheFrontend`](https://docs.modular.com/max/api/python/kv_cache.md)
  classes for coordinating prefix cache resets between frontend and backend
  components.

* Added more APIs for text-to-speech (TTS) models such as
  [`AudioGenerationInputs`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.AudioGenerationInputs)
  and
  [`AudioGenerationOutput`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.AudioGenerationOutput)

* Changed
  [`LoRAConfig.max_num_loras`](https://docs.modular.com/max/api/python/generated/max.pipelines.LoRAConfig#max.pipelines.LoRAConfig.max_num_loras)
  default to `1` (was `100`).

* New [`RequestID`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.RequestID) class
  replaces previous type alias to provide better type safety and consistency
  across the API.

* Removed `InputContext` and replaced it with the modality-output specific
  [`TextGenerationContext`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.TextGenerationContext)
  and
  [`EmbeddingsContext`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.EmbeddingsContext).

* Added
  [`ImageMetadata`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.ImageMetadata) and
  [`VLMTextGenerationContext`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.VLMTextGenerationContext).

* Added [`max.nn.comm`](https://docs.modular.com/max/api/python/nn.md) with `Allreduce` and
  `Signals` for peer-to-peer communication in allreduce.

* [`ops.gather()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.gather) no longer
  has a default `axis`, it must be specified explicitly (better matching PyTorch
  and NumPy).

* [`Graph.add_subgraph()`](https://docs.modular.com/max/api/python/generated/max.graph.Graph#max.graph.Graph.add_subgraph)
  has been updated to take a `devices` argument. This allows subgraphs to take
  advantage of device-aware work scheduling.

#### Mojo API {#25-7-max-mojo}

* Renamed the `tensor_internal` package to `tensor` and removed the
  previous `tensor` stub—the API behaves the same but the [Mojo `tensor`
  docs](https://docs.modular.com/max/api/kernels/extensibility/tensor.md) moved.

### Mojo language {#25-7-mojo}

For all the updates to the Mojo language, standard library, and tools,
including all GPU programming and `Layout`/`LayoutTensor` changes, see the [Mojo
changelog](https://mojolang.org/releases).

## v25.6.1 (2025-10-10)

Fixes a latency regression due to a top-k algorithm change and a couple
other benchmarking bugs.

## v25.6 (2025-09-22)

* [Highlights](#25-6-highlights)
* [Documentation](#25-6-docs)
* [MAX models](#25-6-models)
* [MAX framework](#25-6-max)
  * [Inference server](#25-6-max-serve)
  * [`max` CLI](#25-6-max-cli)
  * [Python API](#25-6-max-python)
* [MAX kernels](#25-6-kernels)
* [Mojo language](#25-6-mojo)

### Highlights {#25-6-highlights}

* MAX delivers **state-of-the-art performance on NVIDIA Blackwell** (B200)!

  We've been describing our Blackwell bring-up over a series of blog posts, and
  we recently published
  [Part 4: Breaking SOTA](https://www.modular.com/blog/matrix-multiplication-on-blackwell-part-4---breaking-sota),
  in which we share our latest matmul benchmarks compared to NVIDIA's cuBLAS
  library.

* MAX provides **industry-leading performance on AMD MI355X**!

  In a matter of weeks, we got MAX running on the brand new MI255X system and
  have already produced early benchmarks that go head-to-head with Blackwell.
  If you have access to an MI355X, you can try it yourself today by following
  our [quickstart guide](https://docs.modular.com/max/get-started.md).

* Benchmarking endpoints is easier than ever before the new [`max
  benchmark`](https://docs.modular.com/max/cli/benchmark.md) command, which accepts YAML
  configuration files so you can easily share and reproduce your benchmarks.

### Documentation {#25-6-docs}

* Our new [quickstart guide](https://docs.modular.com/max/get-started.md) lets you pick the model
  architecture and size you want, and then shows you how to deploy it and run our
  open-source benchmarking script, all from the `max` CLI.

* We updated and simplified the [benchmarking
  tutorial](https://docs.modular.com/max/deploy/benchmark.md) to use the new `max benchmark`
  command.

### MAX models {#25-6-models}

* Added the
  [gpt-oss](https://github.com/modular/modular/tree/modular/v25.6.0/max/pipelines/architectures/gpt_oss)
  model architecture (GPU, bfloat16).
  [Try GPT-OSS now](https://builds.modular.com/models/gpt-oss-20b-BF16/20B).

### MAX framework {#25-6-max}

* Added device-aware work scheduling for AsyncRT: work items can now specify a
  `deviceHint` to route execution to specific worker threads based on device
  affinity, improving multi-device performance.

* Improved code quality by enabling large set of RUFF lints, including
  [flake8-annotations (ANN)](https://docs.astral.sh/ruff/rules/#flake8-annotations-ann)
  which now enforces Python type annotations for new contributions.

#### Inference server {#25-6-max-serve}

* Added support for data parallelism in Llama models. To enable this feature,
  use the `--data-parallel-degree` option:

  ```sh
  max serve --model $MODEL_ID --data-parallel-degree 2 --devices gpu:0,1
  ```

* Metrics for each context encoding and token generation batch are now logged
  to the console periodically. We can override the default frequency (3 seconds)
  of such logs via setting the `MAX_SERVE_SCHEDULER_STATS_LOG_INTERVAL_S` flag.
  For example, setting `MAX_SERVE_SCHEDULER_STATS_LOG_INTERVAL_S=0` will log
  metrics for all batches.

* Improved error messages when pulling a model that requires more RAM than
  what's available or when there won't be enough RAM left for the KV cache.

#### `max` CLI {#25-6-max-cli}

* Added the `max benchmark` subcommand that runs a suite of benchmarks and
  collects performance metrics on a model server. This command provides
  convenient packaging/installation for our open-source
  [`benchmark_serving.py`](https://github.com/modular/modular/tree/main/benchmark#benchmark-max)
  script and accepts all the same options.

* Added `--chat-template` to the CLI for passing a custom chat templates
  defined in Jinja2 template files.

* Renamed the `--allow-safetensors-weights-float32-to-bfloat16-cast` flag to
  `--allow-safetensors-weights-fp32-bf6-bidirectional-cast`, which supports
  automatic bidirectional dtype casts when needed.

* The `max generate` command now supports `--top-k`, `--temperature`, and
  `--seed` flags.

* Changed `--num-warmups` behavior. Previously, it ran the model on the prompt
  `N` times, generating until reaching a stop condition each time. Now it runs
  the model for `N` steps, generating `N` new tokens as a warmup.

* Added the `--model` option as a preferred alternative to `--model-path`. They
  behave the same.

* Deprecated `--pad-to-multiple-of`.

* Removed the previously deprecated `--model-name`. Use `--served-model-name`
  instead.

#### Python API {#25-6-max-python}

* Removed the previously deprecated `KVCacheStrategy.CONTINUOUS` and all
  associated classes (including `ContinuousBatchingKVCacheManager`).

* Added `ops.fence`, a pure
  identity operation that prevents the async runtime from reordering operations
  across it. This operation is essential for implementing cross-device
  synchronization.

* Removed `PipelineConfig.max_new_tokens`. Use
  [`SamplingParams.max_new_tokens`](https://docs.modular.com/max/api/python/pipelines.md#max.pipelines.SamplingParams)
  instead.

* Added
  [`logits_processor`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.SamplingParams.logits_processors)
  to
  [`SamplingParams`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.SamplingParams)
  for updating logits in-place during each step of token generation.

* Added `generate()` to
  [`TextGenerationPipeline`](https://docs.modular.com/max/api/python/generated/max.pipelines.TextGenerationPipeline)
  and
  [`StandaloneSpeculativeDecodingPipeline`](https://docs.modular.com/max/api/python/generated/max.pipelines.lib.StandaloneSpeculativeDecodingPipeline),
  a convenience method for getting text generations. `generate_async()` is
  available for getting streamed outputs.

* Renamed the `target_num_new_tokens` configuration parameter to
  `prefill_chunk_size`
  in
  [`PipelineConfig`](https://docs.modular.com/max/api/python/generated/max.pipelines.PipelineConfig)
  and `TTSConfig` classes to better reflect its role in chunked prefill
  operations.

* Fixed [`ops.range`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.range) to respect
  the `dtype` parameter when using [`Dim`](https://docs.modular.com/max/api/python/graph.md) objects as
  inputs. Previously, the dtype was ignored and defaulted to int64.

* Made the `devices` argument in
  [`InferenceSession()`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession)
  required. To maintain the previous default behavior, use
  `InferenceSession(devices=[CPU()])`.

* Added an optional `logging` argument to
  [`InferenceSession()`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession).
  When set to `"op"`, this option enables operation launch output to stderr.

* Added [`max.nn.lora`](https://docs.modular.com/max/api/python/nn.md), providing
  Low-Rank Adaptation (LoRA) support for parameter-efficient fine-tuning of
  neural network models.

* Added [`max.nn.moe`](https://docs.modular.com/max/api/python/nn.md), implementing
  Mixture of Experts (MoE) layers for scalable model architectures.

* Added [`max.nn.sampling`](https://docs.modular.com/max/api/python/nn.md),
  containing advanced sampling methods including MinP and rejection sampling
  techniques.

* Added [`max.nn.hooks`](https://docs.modular.com/max/api/python/nn.md), providing
  debugging and inspection hooks for neural network layers.

* Added attention submodules
  [`max.nn.attention.mask_config`](https://docs.modular.com/max/api/python/nn.attention),
  [`max.nn.attention.multihead_attention`](https://docs.modular.com/max/api/python/nn.attention),
  and
  [`max.nn.attention.multi_latent_attention`](https://docs.modular.com/max/api/python/nn.attention)
  for comprehensive attention mechanism configuration and implementation.

* Moved some Mojo-related functionality to a new top-level `mojo` Python
  namespace. Specifically, `max.mojo` (previously used for Mojo-Python interop),
  some of `max.support`, and `max.entrypoints.mojo` now live under the `mojo`
  namespace and are provided in the new [`mojo`
  package](https://mojolang.org/docs/manual/install#whats-included).

### MAX kernels {#25-6-kernels}

* Added a leaky ReLU activation function kernel.

* Added a specialized [RMS norm](https://docs.modular.com/max/api/kernels/nn/normalization/rms_norm.md)
  function kernel for the common case of `cols=128`, `bfloat16`.

### Mojo language {#25-6-mojo}

For all the updates to the Mojo language, standard library, and tools,
including all GPU programming changes, see the [Mojo
changelog](https://mojolang.org/releases).

## v25.5 (2025-08-05)

* [Highlights](#25-5-highlights)
* [Documentation](#25-5-docs)
* [MAX models](#25-5-models)
* [MAX framework](#25-5-max)
  * [Inference server](#25-5-max-serve)
  * [`max` CLI](#25-5-max-cli)
  * [Python API](#25-5-max-python)
* [Mojo language](#25-5-mojo)

### Highlights {#25-5-highlights}

* **OpenAI-compatible batch API**: The [`/v1/batches`
  API](https://docs.modular.com/max/rest-api.md#POST/v1/batches) is now available with
  [Mammoth](https://docs.modular.com/mammoth.md).

  We recently announced a [partnership with SF
  Compute](https://www.modular.com/blog/sf-compute) to make this API available
  through their dynamic GPU pricing marketplace. Their Large Scale Inference
  Batch API looks different from the `/v1/batches` API in Mammoth because it's
  a superset.

* **New `mojo` Conda package**: For Mojo-specific projects that run on CPUs and
  GPUs, you can now install the bare essentials with the `mojo` Conda package
  that's less than 900 MB on disk. For example, this now works:

  ```sh
  pixi add mojo
  ```

  The `mojo` Python package is not available for pip/uv yet.

  For a complete model-development and serving toolkit, you should still install
  the `modular` package (which includes `mojo` as a dependency).

* **Open-source graph APIs**: We've added the `max.graph` Python APIs to our
  [GitHub
  repo](https://github.com/modular/modular/tree/modular/v25.5.0/max/graph). We've
  made great strides in recent months to simplify these APIs that help you build
  high-performance models you can [serve with
  MAX](https://docs.modular.com/max/develop/serve-custom-model-architectures.md).

### Documentation {#25-5-docs}

* New [Serve custom model architectures
  tutorial](https://docs.modular.com/max/develop/serve-custom-model-architectures.md), with
  [example code on GitHub](https://github.com/modular/modular/tree/main/max/examples/custom-models).

* New guide for [using LoRA adapters with MAX](https://docs.modular.com/max/serve/lora-adapters.md).

* Updated the [Deploy Llama 3 on GPU
  tutorial](https://docs.modular.com/max/deploy/local-to-cloud.md) with instructions using
  AMD MI300X (on Azure).

* Added [Pixi basics](https://docs.modular.com/pixi.md), which is where we redirect all the now-removed
  Magic docs (see our [announcement migrating Magic to
  Pixi](https://forum.modular.com/t/migrating-from-magic-to-pixi/1530)).

### MAX models {#25-5-models}

* Added support for
  [Idefics3](https://github.com/modular/modular/tree/modular/v25.5.0/max/pipelines/architectures/idefics3)
  model.

### MAX framework {#25-5-max}

* Removed all `torch` package dependencies.

  * Reduces the total installation size of `modular` (including
    dependencies) from 2.2 GB for CPUs and 6.5 GB for GPUs **down to 1.5 GB**, for
    all Python packages. Conda packages pull additional system dependencies so
    sizes may vary, but one example brings the size down from 9.8 GB to 2.0 GB.

  * `pip install` no longer requires the `--extra-index-url
    https://download.pytorch.org/whl/cpu` option (which was to avoid installing
    the GPU version of `torch` that has a lot of CUDA dependencies).

  * `uv pip install` no longer requires the `--index-strategy unsafe-best-match`
    option (which was to avoid package resolution issues with the above
    `--extra-index-url` option).

* Removed HuggingFace fallback for model pipelines not natively supported in
  MAX (`PipelineEngine.HUGGINGFACE`), because it's almost never used and it
  creates significant tech debt.

#### Inference server {#25-5-max-serve}

* Added the [`/health` endpoint](https://docs.modular.com/max/rest-api.md#GET/health) for service
  readiness checks, used by tools like lm-eval to determine when the service is
  ready to accept requests.

* [Prefix caching](https://docs.modular.com/max/serve/prefix-caching.md) now uses a Mojo token hashing
  operation. Previously we used the `hash()` method from the Python stdlib.
  However, this resulted in noticeable CPU overhead and reduced GPU utilization.
  In this release, we migrated the token hashing operation to an accelerated Mojo
  implementation.

* Re-implemented the OpenAI API's `logprobs` and `echo` request
  parameters to eliminate an expensive device transfer.
  The `--enable-echo` flag, which previously incurred a significant performance
  penalty, is now 9-12x faster.

* Added support for `file://` URIs in image inputs for multimodal models. Local
  file access is controlled via the `MAX_SERVE_ALLOWED_IMAGE_ROOTS` environment
  variable, which specifies a list of allowed root directories. Files are read
  asynchronously using aiofiles for better performance under high load.

* Improved [function calling](https://docs.modular.com/max/serve/function-calling.md) (tool use) to more
  reliably extract JSON tool calling responses for Llama models in an
  OpenAI-compatible format.

* Switched from XGrammar to
  [llguidance](https://github.com/guidance-ai/llguidance) for generating
  structured output (constrained decoding).

#### `max` CLI {#25-5-max-cli}

* Added `--vision-config-overrides` CLI option to override
  vision model configuration parameters. For example, to decrease InternVL's
  maximum dynamic patches from 12 to 6:

  ```bash
  max serve --model-path OpenGVLab/InternVL3-38B-Instruct \
    --vision-config-overrides '{"max_dynamic_patch": 6}'
  ```

* Removed `--ignore-eos` CLI argument. The full set of OpenAI chat and
  completion sampling parameters are now supported in the http requests. As
  such, the parameter can just be set via the http payload.

#### Python API {#25-5-max-python}

* Added the [`max.interfaces`](https://docs.modular.com/max/api/python/interfaces.md) module. This module
  should serve as a relatively import free module to hold all shared interfaces
  across the MAX stack. Slowly we will be moving common interfaces to this
  module. So far, we've moved the following from `max.pipelines.core`:

  * Moved `TextGenerationStatus`, `TextResponse`, `TextGenerationResponse`,
    `InputContext`, and `PipelineTask` into `max.interfaces`.

  * Moved all `TokenGeneratorRequest`-prefixed objects into `max.interfaces`
    and renamed with the `TextGenerationRequest` prefix.

  * Moved `TextGenerationStatus` to
    [`GenerationStatus`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.GenerationStatus).

  * Moved `TextResponse` and `TextGenerationResponse` to
    [`TextGenerationOutput`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.TextGenerationOutput).

  * Moved `EmbeddingsResponse` to
    [`EmbeddingsOutput`](https://docs.modular.com/max/api/python/interfaces.md#max.interfaces.EmbeddingsOutput).

* Added [`ops.scatter_nd`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.scatter_nd)
  operation for scattering updates into a tensor at specified indices.

* Added [`ops.avg_pool2d`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.avg_pool2d)
  and [`ops.max_pool2d`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.max_pool2d).

* Added [`max.torch.graph_op`](https://docs.modular.com/max/api/python/torch.md#max.torch.graph_op)
  interface to make it simple to embed larger MAX computations and models inside
  PyTorch. These can use `max.nn` modules internally and may be used within
  `torch.nn` modules, allowing the use of MAX subcomponents for access to our
  high performance graph compiler and Mojo kernel library.

  ```python
  import torch
  import numpy as np
  import max
  from max.dtype import DType
  from max.graph import ops

  @max.torch.graph_op
  def max_grayscale(pic: max.graph.TensorValue):
      scaled = pic.cast(DType.float32) * np.array([0.21, 0.71, 0.07])
      grayscaled = ops.sum(scaled, axis=-1).cast(pic.dtype)
      # max reductions don't remove the dimension, need to squeeze
      return ops.squeeze(grayscaled, axis=-1)

  @torch.compile
  def grayscale(pic: torch.Tensor):
      output = pic.new_empty(pic.shape[:-1])  # Remove color channel dimension
      max_grayscale(output, pic)  # Call as destination-passing style
      return output

  img = (torch.rand(64, 64, 3, device=device) * 255).to(torch.uint8)
  result = grayscale(img)
  ```

* Moved `AlgebraicDim`, `Dim`, `StaticDim`, and `SymbolicDim` out of `max.type`
  and into [`max.graph.dim`](https://docs.modular.com/max/api/python/graph.md). You can still import
  them directly from `max.graph`.

* Moved `Shape` out of `max.type` and into
  [`max.graph.shape`](https://docs.modular.com/max/api/python/graph.md). You can still import it
  directly from `max.graph`.

* Removed the ability to pass Python objects into models and have them returned
  as Mojo `PythonObject` types in the kernels.

* Removed `RandomWeights`.

* Removed `Model.execute_legacy()`. Instead use the
  standard [`execute()`](https://docs.modular.com/max/api/python/engine.md#max.engine.Model.execute) or
  [`__call__()`](https://docs.modular.com/max/api/python/engine.md#max.engine.Model.__call) methods.

* Removed TorchScript-related helper functions and APIs, including support for
  `.pt` TorchScript files in custom extensions.

### Mojo language {#25-5-mojo}

For all the updates to the Mojo language, standard library, and tools,
including all GPU programming changes, see the [Mojo
changelog](https://mojolang.org/releases).

## v25.4 (2025-06-18)

* [v25.4 (2025-06-18)](#v254-2025-06-18)
  * [✨ Highlights {#25-4-highlights}](#-highlights-25-4-highlights)
  * [Documentation {#25-4-docs}](#documentation-25-4-docs)
  * [MAX models {#25-4-models}](#max-models-25-4-models)
  * [MAX framework {#25-4-max}](#max-framework-25-4-max)
    * [Inference server {#25-4-max-serve}](#inference-server-25-4-max-serve)
    * [`max` CLI {#25-4-max-cli}](#max-cli-25-4-max-cli)
    * [Python API {#25-4-max-python}](#python-api-25-4-max-python)
    * [Mojo API {#25-4-max-mojo}](#mojo-api-25-4-max-mojo)
    * [Custom ops {#25-4-custom-ops}](#custom-ops-25-4-custom-ops)
    * [GPU programming {#25-4-gpu-programming}](#gpu-programming-25-4-gpu-programming)
  * [Mojo language {#25-4-mojo}](#mojo-language-25-4-mojo)

### ✨ Highlights {#25-4-highlights}

* **AMD GPUs are officially supported!**

  You can now deploy MAX with acceleration on AMD MI300X and MI325X GPUs, using
  the same code and container that works on NVIDIA GPUs. For the first time, you
  can build portable, high-performance GenAI deployments that run on any
  platform without vendor lock-in or platform-specific optimizations.

  For more details, including benchmarks, see our
  [Modular + AMD blog post](https://www.modular.com/blog/modular-x-amd-unleashing-ai-performance-on-amd-gpus).

* **Now accepting GPU kernel contributions**

  Last month, we open-sourced the code for the CPU and GPU kernels that power
  the MAX framework, and now we're accepting contributions! For information
  about how to contribute and the sort of kernels most interesting to us, see
  the
  [MAX AI kernels contributing guide](https://github.com/modular/modular/blob/main/max/kernels/CONTRIBUTING.md).

* **Preview: Mojo interoperability from Python**

  This release includes an early version of a new Python-to-Mojo
  interoperability API. You can now write just the performance-critical parts
  your code in Mojo and call it from Python just like you're importing another
  Python library. Check out our docs to [call Mojo from
  Python](https://mojolang.org/docs/manual/python/mojo-from-python).

### Documentation {#25-4-docs}

We've redesigned [builds.modular.com](https://builds.modular.com) and
[docs.modular.com](https://docs.modular.com) with a unified top navigation bar
that so you can more easily discover all the available docs and code resources.

New docs:

* [GPU Puzzles](https://builds.modular.com/puzzles/introduction.html): Several
  new puzzles, including: 1D convolution op, softmax op, attention op,
  embedding op, kernel fusion, custom backward pass, GPU functional programming
  patterns, and warp fundamentals.

* [Using AI coding assistants guide](https://docs.modular.com/max/coding-assistants.md): Learn how to use
  large language models (LLMs) and coding assistants (such as Cursor and Claude
  Code) to accelerate your development with Modular.

* [Build an MLP block as a graph module tutorial](https://docs.modular.com/max/develop/build-an-mlp-block.md):
  Learn how to create reusable `Module` components in your MAX graphs.

* [Write custom ops for PyTorch
  tutorial](https://docs.modular.com/max/develop/custom-kernels-pytorch.md) (Beta feature): Learn to write
  high-performance GPU kernels for your PyTorch models with Mojo.

* [Profile MAX kernel
  performance](https://github.com/modular/modular/blob/main/max/docs/kernel-profiling.md):
  Learn how to set up Nsight Compute to profile your Mojo-based kernels on NVIDIA
  GPUs.

Major updates:

* [Build custom ops for GPUs tutorial](https://docs.modular.com/max/develop/build-custom-ops.md):
  Now includes how to write hardware-specific functions for CPUs and GPUs.

* [Optimize a matrix multiply custom op
  tutorial](https://docs.modular.com/max/develop/custom-ops-matmul.md): Migrated from a Recipe with
  revisions to help you improve the performance of your GPU custom ops.

### MAX models {#25-4-models}

* Added the OLMo 2 model architecture
  ([`olmo2`](https://github.com/modular/modular/tree/modular/v25.4.0/max/pipelines/architectures/olmo2)).

  [Try OLMo 2 now](https://builds.modular.com/models/OLMo-2-1124/7B).

* Added Google's Gemma 3 multimodal model architecture
  ([`gemma3multimodal`](https://github.com/modular/modular/tree/modular/v25.4.0/max/pipelines/architectures/gemma3)).

  [Try Gemma3 now](https://builds.modular.com/models/gemma-3-it/1B).

* Added the Qwen 3 model architecture
  ([`qwen3`](https://github.com/modular/modular/tree/modular/v25.4.0/max/pipelines/architectures/qwen3)).

  [Try Qwen3 now](https://builds.modular.com/models/Qwen3/1.7B).

* Added the InternVL3 model architecture
  ([`internvl`](https://github.com/modular/modular/tree/modular/v25.4.0/max/pipelines/architectures/internvl)).
  This is still a work in progress.

* GGUF quantized Llamas (q4\_0, q4\_k, and q6\_k) are now supported with paged
  KVCache strategy.

### MAX framework {#25-4-max}

#### Inference server {#25-4-max-serve}

* Inflight batching no longer requires chunked prefill.

* Expanded token sampling logic, including top\_k, min\_p, min\_new\_tokens,
  temperature.

* Extended sampling configuration to be per-request, e.g. different requests
  can ask for different sampling hyperparameters.

* Removed support for TorchScript and torch MLIR models.

#### `max` CLI {#25-4-max-cli}

* Added the `--use-subgraphs` flag to `max generate` to allow for the use of
  subgraphs in the model.

* Added the `--port` option to specify the port number with the `max serve`
  command.

#### Python API {#25-4-max-python}

* Lots of new APIs in the [`max.nn`](https://docs.modular.com/max/api/python/nn.md) package.

* Added `max.mojo.importer` module to import Mojo code into Python. See the
  docs for
  [calling Mojo from Python](https://mojolang.org/docs/manual/python/mojo-from-python).

* Added
  [`Graph.add_subgraph()`](https://docs.modular.com/max/api/python/generated/max.graph.Graph#max.graph.Graph.add_subgraph)
  to allow for the addition of a subgraph to a graph.

* Added
  [`Module.build_subgraph()`](https://docs.modular.com/max/api/python/generated/max.nn.Module#max.nn.Module.build_subgraph)
  to allow for the creation of a subgraph for a layer that inherits from `Module`.

* Added the [`call`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.call) op
  which allows for the execution of a subgraph.

* Added the [`fold`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.fold) op for
  combining sliding blocks into a larger tensor.

* Added
  [`KernelLibrary`](https://docs.modular.com/max/api/python/generated/max.graph.KernelLibrary)
  as an argument type for the
  [`Graph`](https://docs.modular.com/max/api/python/generated/max.graph.Graph) constructor.

* Added
  [`QuantizationConfig`](https://docs.modular.com/max/api/python/generated/max.graph.quantization.QuantizationConfig)
  to specify quantization parameters for ops such as
  [`qmatmul()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.qmatmul).

* Added the `strict` argument to the
  [`Module.load_state_dict()`](https://docs.modular.com/max/api/python/generated/max.nn.Module#max.nn.Module.load_state_dict)
  method. When `strict=True` (default), an error is raised if the `state_dict`
  contains unused keys. When `strict=False`, extra keys are ignored. This helps
  model developers identify missing implementations in their models.

* Added audio generator APIs for text-to-speech models (such as
  `AudioGenerator`,
  `PipelineAudioTokenizer`,
  [`TTSContext`](https://docs.modular.com/max/api/python/generated/max.pipelines.TTSContext),
  and others). This is still a work in progress.

* The
  [`ops.masked_scatter()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.masked_scatter)
  function now requires naming the `out_dim` explicitly as it is data-dependent.
  For example:

  ```python
  ops.masked_scatter(
      inputs_embeds, video_mask, video_embeds, out_dim="unmasked_inputs"
  )
  ```

* Deprecated the `CONTINUOUS` KVCache strategy
  (`KVCacheStrategy`).
  Please use `PAGED` KVCache strategy instead.

* Removed the `Settings` argument from
  [`LLM`](https://docs.modular.com/max/api/python/entrypoints.md#max.entrypoints.llm.LLM) constructor. The
  server is now automatically configured in the background without consuming an
  HTTP port.

* Removed `Graph.unique_symbolic_dim()`.

* Removed `max_to_torch_type()` and `torch_to_max_type()` and replaced them with
  [`DType.to_torch()`](https://docs.modular.com/max/api/python/dtype.md#max.dtype.DType.to_torch) and
  [`DType.from_torch()`](https://docs.modular.com/max/api/python/dtype.md#max.dtype.DType.from_torch),
  respectively. This aligns with the corresponding NumPy methods.

* Removed `stats_report` property and `reset_stats_report` method from
  [`InferenceSession`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession). This
  functionality was primarily used for internal PyTorch debugging and is no
  longer needed.

* Removed the naive KVCache (`nn.kv_cache.naive_cache`).

* Removed `nn.attention` and `nn.naive_attention_with_rope`.

* Renamed `ops.select` to
  [`ops.where`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.where). This matches the
  name of the similar operation in torch and numpy.

#### Mojo API {#25-4-max-mojo}

* [`LayoutTensor`](https://mojolang.org/docs/layout/layout_tensor/LayoutTensor/)
  now has a
  `size` method to get the total number of elements.

* Following our [previous deprecation](https://docs.modular.com/max/changelog.md#25-3-engine-mojo-api) of
  the Mojo `max.driver`, `max.graph` and `max.engine` APIs, we've removed them
  from the package and API docs.

As a result, we've also removed Mojo `max.tensor` APIs (including `Tensor`,
`TensorShape`, and `TensorSpec`). You can replace any use with
[`LayoutTensor`](https://mojolang.org/docs/layout/layout_tensor/LayoutTensor/).

#### Custom ops {#25-4-custom-ops}

* Improved error messages when custom op parameters are provided with values
  that don't have the proper type.

* The [`ops.custom()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.custom) function
  now requires a `device` argument to specify where the operation should execute.
  This avoids the need for custom ops to infer their execution device, which can
  be error-prone.

* Added the [`max.torch`](https://docs.modular.com/max/api/python/torch.md) module with the
  `CustomOpLibrary` class for using custom Mojo kernels from PyTorch. For
  example, with a custom `grayscale` operation written in Mojo:

  ```mojo
  @register("grayscale")
  struct Grayscale:
      @staticmethod
      fn execute[
          # The kind of device this is running on: "cpu" or "gpu"
          target: StaticString,
      ](https://docs.modular.com/max/changelog/img_out: OutputTensor[dtype = DType.uint8, rank=2],
          img_in: InputTensor[dtype = DType.uint8, rank=3],
          ctx: DeviceContextPtr,) raises:
          ...
  ```

  You can load it with PyTorch like so:

  ```python
  from max.torch import CustomOpLibrary

  op_library = CustomOpLibrary("path/to/custom.mojopkg")

  @torch.compile(backend=backend)
  def grayscale(pic):
      result = pic.new_empty(pic.shape[:-1])
      op_library.grayscale(result, pic)
      return result

  img = (torch.rand(64, 64, 3) * 255).to(torch.uint8)
  result = grayscale(img)
  ```

  See our

  [tutorial to write custom ops for PyTorch](https://docs.modular.com/max/develop/custom-kernels-pytorch.md),
  and our

  [PyTorch custom operation examples](https://github.com/modular/modular/tree/main/max/examples/pytorch_custom_ops),
  which range from a very basic "hello world" to the replacement of a layer in a
  full model.

#### GPU programming {#25-4-gpu-programming}

* Full support for AMD CDNA3 datacenter GPUs is now available! Specifically,
  MI300X and MI325X.

* Added initial support for programming on AMD RDNA3 consumer GPUs. Basic
  tuning parameters have been specified for AMD Radeon 780m integrated GPUs. (AMD
  RDNA3 support is for GPU programming only; AI models are still missing some GPU
  kernels for this architecture.) For details, see the [GPU
  requirements](https://docs.modular.com/max/packages.md#gpu-compatibility).

* Now accepting CPU and GPU kernel contributions. See the [MAX AI kernels
  contributing
  guide](https://github.com/modular/modular/blob/main/max/kernels/CONTRIBUTING.md).

### Mojo language {#25-4-mojo}

For all the updates to the Mojo language, standard library, and tools, see the
[Mojo changelog](https://mojolang.org/releases).

## v25.3 (2025-05-06)

* [Highlights](#25-3-highlights)
* [Documentation](#25-3-docs)
* [`max` CLI](#25-3-max-cli)
* [MAX models](#25-3-models)
* [MAX Serve](#25-3-serve)
* [MAX Engine & Graph](#25-3-engine)
  * [Python API](#25-3-engine-mojo-api)
  * [Mojo API](#25-3-engine-mojo-api)
  * [Custom ops](#25-3-custom-ops)
* [Kernels](#25-3-kernels)
* [GPU programming](#25-3-gpu-programming)
* [Mojo language](#25-3-mojo)

### ✨ Highlights {#25-3-highlights}

* You can now **install Modular APIs and tools with pip**:

  ```sh
  pip install modular \
    --index-url https://download.pytorch.org/whl/cpu
  ```

  This installs the `max` CLI, `max` Python library, `mojo` CLI, and Mojo
  libraries. However, the Mojo LSP and debugger are currently not included.

  We use the `--index-url` argument to ensure that `torch` installs its CPU
  dependencies only, thus avoiding a lot of unnecessary GPU packages. This is a
  temporary workaround until we can remove our dependency on `torch`.

* We **open-sourced the MAX AI kernels** and the rest of the **Mojo standard
  library**!

The
[MAX AI kernels library](https://mojolang.org/docs/lib#max-ai-kernels-library)
is a new Mojo API for writing high-performance and portable programs across CPU
and GPU, but it's also
[the source code for our CPU/GPU kernels](https://github.com/modular/modular/tree/main/max/kernels/src).
You can now see the Mojo code we use in MAX to power GenAI workloads on CPUs and
GPUs.

Just like the Mojo standard library, these kernels are open source under the
Apache 2.0 License with LLVM exceptions. Plus, the rest of the Mojo standard
library is also [now open source in
GitHub](https://github.com/modular/modular/tree/main/mojo/std/src).

* **Learn to program GPUs** with
  [Mojo GPU Puzzles](https://builds.modular.com/puzzles)!

  This is a brand new site that offers a hands-on guide to mastering GPU
  programming with Mojo. Starting from basic concepts, you'll learn step-by-step
  how to program for GPUs by solving increasingly challenging puzzles.

### Documentation {#25-3-docs}

We've restructured the documentation to unify MAX and Mojo documentation
under the Modular Platform. We believe this improves content discovery with a
simplified navigation and helps unify the platform story as a whole.

We've also added the following new docs:

* [REST API reference](https://docs.modular.com/max/rest-api.md): Although it's not a new API (our
  serving library has supported OpenAI APIs for the last few versions), this
  now shows precisely which endpoints and body parameters we support.

* [Speculative decoding](https://docs.modular.com/max/serve/speculative-decoding.md): An introduction to
  using speculative decoding to reduce latency for LLMs. This feature is still in
  development.

* [Offline inference](https://docs.modular.com/max/serve/offline-inference.md): An introduction to our
  Python API for running inference with an LLM locally (without sending requests
  to a serving endpoint).

* [Introduction to layouts](https://mojolang.org/docs/manual/layout/layouts):
  A guide to working
  with dense multidimensional arrays on CPUs and GPUs, using new Mojo `layout`
  types that abstract-away complex memory layout patterns.

### `max` CLI {#25-3-max-cli}

* Renamed the `max-pipelines` CLI tool to `max`. We recommend re-installing
  it as shown in the [`max` CLI docs](https://docs.modular.com/max/cli.md).

* Remove previously deprecated `--use-gpu`, `--serialized_model_path`,
  `--save_to_serialized_model_path`, `--max_cache_batch_size` and
  `--huggingface-repo-id` options.

* Move `InputContext`, `TextContext`, and `TextAndVisionContext` from
  `max.pipelines` to `max.pipelines.context`.

### MAX models {#25-3-models}

* Added `Llama4ForConditionalGeneration` support, featuring new MoE layers.
  Currently, it is limited to text inputs. Run the model by calling:

  ```sh
  max generate --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --devices 0,1,2,3
  ```

* Added support for running text generations using the Mistral 3 24B model. Run
  the model with:

  ```sh
  max generate --model-path mistralai/Mistral-Small-3.1-24B-Instruct-2503 --devices 0
  ```

* Fixed empty textual outputs for certain Mistral models
  ([MAX issue 4193](https://github.com/modular/modular/issues/4193)).

* Added support for loading a custom pipeline architecture by module. Using
  `--custom-architectures=folder/path/to/import:my_module` will lead to loading
  architectures from the file. The architectures must be exposed via an
  `ARCHITECTURES` variable in the file. Once loaded, a model can be run using the
  new architectures. The flag can be specified multiple times to load more
  modules.

### MAX Serve {#25-3-serve}

* Moved from radix trie to hash based prefix caching implementation which has
  smaller CPU overheads. This improves performance particularly in workloads with
  high cache reuse rates.

* Added experimental support for offloading KVCache to host memory via the
  `--enable-kvcache-swapping-to-host` and `--host-kvcache-swap-space-gb` flags.
  This allows for superior KVCache reuse through prefix caching in workloads
  where the reusable KVCache amount exceeds GPU VRAM.

* Fixed the `usage.prompt_tokens` field in the OpenAI API Usage Info response.
  Previously this field was always set to Null, but now it correctly
  contains the number of prompt tokens in the request.

* Switched from Python Multiprocessing Queue to ZeroMQ. This reduces latencies
  between frontend server process and model worker process related to networking.

* Stray model workers on Linux now terminate more reliably when the parent
  process is killed.

### MAX Engine & Graph {#25-3-engine}

#### Python API {#25-3-engine-python-api}

* We now raise an error if there's a mismatch between the expected device of a
  weight on a graph and the device of the actual tensor data specified in
  [`InferenceSession.load()`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession.load).

* Removed `output_device` argument from
  [`Model.execute()`](https://docs.modular.com/max/api/python/engine.md#max.engine.Model.execute).

* Removed the `copy_inputs_to_device` argument in
  [`Model.execute`](https://docs.modular.com/max/api/python/engine.md#max.engine.Model.execute) to improve
  predictability of the API. Now `execute()` raises a `TypeError` if arguments
  are passed whose devices don't match the model.

* Swapped the order of the `dtype` and `shape` fields of
  [`driver.Tensor`](https://docs.modular.com/max/api/python/driver.md#max.driver.Tensor).
  Previously, the arguments are ordered as `(shape, dtype)`. They are now swapped
  to `(dtype, shape)` to be in line with other tensor-like types.

* Replaced some instances of
  [`Tensor.zeros`](https://docs.modular.com/max/api/python/driver.md#max.driver.Tensor.zeros)
  with `Tensor.__init__` when the engine did not depend on the tensor being zero
  initialized. This elides the unnecessary memset to provide a minor performance
  improvement.

* Added a new experimental
  [`Tensor.inplace_copy_from()`](https://docs.modular.com/max/api/python/driver.md#max.driver.Tensor.inplace_copy_from).
  This allows users to copy the contents of one `Tensor` into another.

* Made the default behavior of
  [`Weight`](https://docs.modular.com/max/api/python/generated/max.graph.Weight)
  as expecting the initial allocation on host. A transfer is then inserted to the
  target device and this value is returned when weights generate an MLIR value.
  This is done due to current conservative ownership around external weights.

* Added the [`irfft`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.irfft) op, which
  computes the inverse real fast fourier transform (FFT).

* Added the [`argmax`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.argmax) op,
  which returns the index of the maximum value in an array or sequence.

* Added the [`GroupNorm`](https://docs.modular.com/max/api/python/nn.md) layer.

* Switched layer names so that `max.nn` layers that are implemented with the
  deprecated `Layer` class are marked as "V1", and layers that are implemented
  with the new [`max.nn.Module`](https://docs.modular.com/max/api/python/generated/max.nn.Module)
  are the default. That is, `max.nn.LinearV2` is now
  [`max.nn.Linear`](https://docs.modular.com/max/api/python/generated/max.nn.Linear), and the
  previous `max.nn.Linear` is now
  `max.nn.LinearV1`.

* DeviceRefs in types/layers are in general expected to be explicit rather than
  implicit.

#### Mojo API {#25-3-engine-mojo-api}

* Removed some functionality from
  [`tensor.Tensor`](https://docs.modular.com/max/api/kernels/extensibility/tensor/tensor/Tensor.md):

  * Serializing `Tensor` to disk (`Tensor.tofile(path)` and
    `Tensor.save(path)`).
  * Reading the serialized data back from disk (`Tensor.load(path)` and
    `Tensor.fromfile(path)`.
  * `rand` and `randn` methods have been removed. Use the ones in the Mojo
    standard library if you still need access for constructing a new `Tensor`
    with random elements based on a particular `TensorShape`.

* **Deprecated the Mojo Driver, Graph, and Engine APIs**

  These APIs are not currently used internally. Instead, we build graphs using
  the Python APIs, and our engineering efforts have been focused on making that
  experience as robust and user-friendly as possible. As a result, the Mojo
  versions of these APIs have not kept pace with new features and language
  improvements. These APIs will be open sourced for the community before being
  removed.

#### Custom ops API {#25-3-custom-ops}

* You can now pass Mojo source package paths as
  [`Graph`](https://docs.modular.com/max/api/python/generated/max.graph.Graph)
  custom extensions. The Mojo code will be
  compiled automatically, no need to run `mojo package` manually as a prior step.
  Previously, only pre-compiled `.mojopkg` paths were accepted, requiring the
  Mojo code to be built as a prerequisite step before running a `Graph` with a
  custom op.

  Given a project structure like:

  ```text
  project
  |-- main.py
  \-- kernels
      |-- __init__.mojo
      \-- my_custom_op.mojo
  ```

  You can construct a `Graph` in `main.py` using Mojo custom op kernels simply
  using:

  ```python
  g = Graph(
    ...,
    custom_extensions = [Path(__file__).parent / "kernels"]
  )
  ```

  A change to your Mojo source code defining a custom op will be reflected
  immediately the next time the `Graph` is constructed.

* New
  [image\_pipeline example](https://github.com/modular/modular/tree/main/max/examples/custom_ops)
  that demonstrates sequencing custom ops together which modify an image,
  leaving data on the GPU for each op, before writing it back to CPU and disk.

### Kernels {#25-3-kernels}

* More compute overlap is now enabled for Hopper GPUs. This allows finer-grained
  scheduling of kernel operations by analyzing producer-consumer patterns within
  a compute kernel. As a result, there is more kernel compute overlap,
  especially for compute-heavy kernels with data-dependent execution paths.

### GPU programming {#25-3-gpu-programming}

* CUDA driver requirement reduced to version 12.4 and the NVIDIA driver to be
  version 550. Requiring these earlier driver versions allows MAX to be more
  easily deployed on AWS and GCP, since these are the default versions used by
  those cloud providers.

* Added support for programming NVIDIA Jetson Orin GPUs (`sm_87`).

Also see the
[Mojo changelog of GPU changes](https://mojolang.org/releases#gpu-changes).

### Mojo language {#25-3-mojo}

* We recently open-sourced the rest of the Mojo standard library, including the
  `algorithm`, `benchmark`, `buffer`, `compile`, `complex`, `gpu`, and `layout`
  packages. [See it all in
  GitHub](https://github.com/modular/modular/tree/main/mojo/std/src).

* We've also open sourced [all our MAX AI
  kernels](https://github.com/modular/modular/tree/main/max/kernels/src).
  This new library includes `kv_cache`, `layout`, `linalg`, `nn`, `nvml`, and
  `quantization`.

For all the updates to the Mojo language, standard library, and tools, see the
[Mojo changelog](https://mojolang.org/releases).

## v25.2 (2025-03-25)

* [Highlights](#25-2-highlights)
* [MAX Serve](#25-2-serve)
* [MAX models](#25-2-models)
  * [`max-pipelines` CLI](#25-2-pipelines-cli)
* [MAX Engine](#25-2-engine)
  * [Driver APIs](#25-2-driver)
  * [Graph APIs](#25-2-graph)
  * [Custom ops](#25-2-custom-ops)
  * [Hopper Kernels](#25-2-hopper-kernels)
* [GPU programming](#25-2-gpu-programming)
* [Mojo](#25-2-mojo)
* [Documentation](#25-2-documentation)

### ✨ Highlights {#25-2-highlights}

* **Support for NVIDIA Hopper GPUs**

  MAX has been optimized to run on Hopper GPUs. For more information on MAX and
  NVIDIA's hardware, see the [MAX
  container](https://docs.modular.com/max/container.md#recommended-cloud-instances) documentation.

* **Multi-GPU support**

  MAX uses tensor parallelism to distribute work across multiple GPUs so you can
  run LLMs like
  [`Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct),
  even with long context window.

* **Expanded library of MAX models**

  We're rapidly growing our library of base model architectures that MAX can
  accelerate with MAX Serve (including `Phi3ForCausalLM`, `OlmoForCausalLM`,
  and `GraniteForCausalLM`). We also now support `GTPQ` for the Llama models.
  For more information, check out our [MAX model
  repository](https://builds.modular.com/?category=models).

* **Advanced E2E optimizations for long context window**

  In flight batching, chunked prefill, and copy-on-write optimize the execution
  for prefix heavy and long context window scenario.

* **GPU programming with Mojo**

  Lots of new APIs are now available to enable both low-level GPU programming
  and abstracted programming patterns that simplify the code required to write
  GPU kernels for your AI models.

### MAX Serve {#25-2-serve}

* Extended MAX Serve batch scheduling to account for the prefix cache. The
  scheduler can now create larger batches when many prompt tokens are already
  cached, improving throughput up to 10% in some benchmarks.

* Added support for in-flight batching, allowing token generation requests to be
  scheduled alongside context encoding requests to reduce inter-token latency.
  This behavior can be controlled by CLI argument `--enable-in-flight-batch`.

* Added support for copy-on-write on KV blocks when using PagedAttention with
  Prefix Caching. This improves the prefix cache hit rate and prefill performance
  in some scenarios.

* MAX Serve now supports `transformers` v.4.49.0, with a patch
  to avoid graph breaks when using `torch.compile()` on Llama models.

* Added support for recording HTTP traffic out to a file for diagnostics or
  later
  replay.

### MAX models {#25-2-models}

* Added support for executing `LlamaForCausalLM` architecture models on multiple
  GPUs. The model uses tensor parallelism automatically when passing multiple
  device IDs to the `--devices` CLI argument. Try running
  [`meta-llama/Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
  on 4 GPUs with the following example:

  ```sh
  max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
    --quantization-encoding bfloat16 \
    --devices gpu:0,1,2,3 \
    --prompt="Design a
      self-sustaining colony on Neptune's moon Triton with a myth/science
      fusion name, three quantum tech breakthroughs, one ethical debate, a
      neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
  ```

* Added support for the `Phi3ForCausalLM` model architecture (such as
  [`microsoft/phi-4`](https://huggingface.co/microsoft/phi-4)). For example:

  ```sh
  max-pipelines generate \
    --model-path microsoft/phi-4 \
    --prompt "Write bubble sort in mojo"
  ```

* Added support for the `OlmoForCausalLM` model architecture (such as
  [`allenai/OLMo-1B-0724-hf`](https://huggingface.co/allenai/OLMo-1B-0724-hf)).
  For example:

  ```sh
  max-pipelines generate \
    --model-path allenai/OLMo-1B-0724-hf \
    --prompt "Write bubble sort in mojo"
  ```

* Added support for the `GraniteForCausalLM` model architecture (such as
  [`ibm-granite/granite-3.1-8b-instruct`](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)).
  For example:

  ```sh
  max-pipelines generate \
    --model-path ibm-granite/granite-3.1-8b-instruct \
    --prompt "Write bubble sort in mojo"
  ```

* Added support for:

  * [`microsoft/Phi-3.5-mini-instruct`](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)
  * [`microsoft/phi-4`](https://huggingface.co/microsoft/phi-4)
  * [`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
  * [`LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct`](https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct)

* We now support GPTQ quantization for models that run on the GPU. This is
  handled transparently when the model weights are specified. For example, this
  runs Llama 3.1 8B using int4-quantized GPTQ weights:

  ```sh
  max-pipelines generate \
    --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
    --prompt "Why is the sky blue?" \
    --max-batch-size 1 \
    --max-length 10000
  ```

  This reduces the total memory consumption of this model from \~16 GB to \~5 GB,
  allowing the model to fit in the RAM smaller GPUs.

* Model weights are now downloaded in parallel.

* Added constraints on whitespace during [Structured
  Output](https://docs.modular.com/max/serve/structured-output.md). This reduces tokens counts and improves
  model adherence.

* Added jump ahead decoding during Structured Output. This auto-completes tokens
  when a singular path forward is identified, improving single completion times by
  up to \~20% for long prompts.

* In the event of an unhandled exception, we now use the standard Python
  traceback format instead of using pretty-printed Rich tracebacks.

* We now need to explicitly import `LLM` from
  [`max.entrypoints.llm`](https://docs.modular.com/max/api/python/entrypoints.md) rather than the previous
  `max.entrypoints` import.

* The `max.pipelines.dataprocessing.tokenizer` and
  `max.pipelines.dataprocessing.gguf_utils` modules have been removed.

* The previously deprecated `PipelineConfig.architecture` field and its
  corresponding `--architecture` CLI argument have been removed.

### `max-pipelines` CLI {#25-2-pipelines-cli}

* The `--devices` CLI argument now supports a comma-separated list of GPU IDs
  prefixed with `gpu:` like `--devices=gpu:0,1,2,3`. We no longer support the
  previous `--devices=gpu-<N>` format.

  ```sh
  max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
    --quantization-encoding bfloat16 \
    --devices gpu:0,1,2,3 \
    --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
  ```

* Removed `--huggingface-repo-id`
  [PipelineConfig](https://docs.modular.com/max/api/python/generated/max.pipelines.PipelineConfig)
  option and CLI argument in favor of `--model-path`.

* We consolidated `--model-path` and `-weight-path`. Valid `--weight-path`
  values
  now override `--model-path`, which handles both local and remote (Hugging Face)
  cases. If we cannot derive the weights from the `--weight-path`, we now fall
  back to the `--model-path`, which you must set explicitly.

* Added `--huggingface-revision` option, to allow selecting a non-default branch
  or a specific commit in a Hugging Face model repository.

### MAX Engine {#25-2-engine}

* The MAX graph compiler now has kernel caching. This is a significant
  improvement to our compilation pipeline. Here are some of the highlights:

* Up to 28% faster compilation times when making iterative changes to models

* Improved caching between different but similar models (up to 27% faster)

* Lays foundation for future caching optimizations

What does this mean for you? Faster development cycles! When you're working on
model pipelines and making changes to the graph, the graph compiler will now
intelligently reuse kernels that haven't changed, significantly reducing
compilation times.

The improvements are particularly noticeable during iterative development, with
compilation times dropping from \~80s to \~57s in some cases of compiling
Llama3.1-8B for 4 GPUs. Even when compiling different models from the same
family (like Llama/Granite variants), you'll see significant speedups on
subsequent compilations.

### Driver APIs {#25-2-driver}

* Added `Accelerator.can_access(other: Device) -> bool` method to check if one
  device can directly access memory of another device.

* Fixed a bug in `max.driver.tensor.load_max_tensor()` for `bfloat16` dtype,
  which would cause an error about mmap size being too large.

* `max.driver.Tensor.item()` now works on any single-element tensor (previously
  restricted to rank-0 tensors).

* Added
  [`Device.synchronize()`](https://docs.modular.com/max/api/python/driver.md#max.driver.Device.synchronize),
  which ensures all operations on the device complete before returning.

* Removed `MojoCallContextPtr` in favor of `DeviceContextPtr`.
  `MojoCallContextPtr` only contained a `DeviceContextPtr`, so this change
  directly exposes the `DeviceContextPtr`. Custom ops using `MojoCallContextPtr`
  now directly take a `DeviceContextPtr` argument:

  ```mojo
      @staticmethod
      fn execute[
          type: DType, rank: Int
      ](https://docs.modular.com/max/changelog/output: OutputTensor[type=type, rank=rank],
          input: InputTensor[type=type, rank=rank],
          ctx: MojoCallContextPtr,.md):
  ```

  becomes

  ```mojo
      @staticmethod
      fn execute[
          type: DType, rank: Int
      ](https://docs.modular.com/max/changelog/output: OutputTensor[type=type, rank=rank],
          input: InputTensor[type=type, rank=rank],
          ctx: DeviceContextPtr,.md):
  ```

* You can now skip compiling a GPU kernel first before enqueueing it, and pass
  a function directly to `ctx.enqueue_function[func](https://docs.modular.com/max/changelog/...)`:

  ```mojo
  fn func():
      print("Hello from GPU")

  @register("custom_op")
  struct CustomOp:

      @staticmethod
      fn execute(ctx: DeviceContextPtr) raises:
          var dev_ctx = ctx.get_device_context()
          dev_ctx.enqueue_function[func](https://docs.modular.com/max/changelog/grid_dim=1, block_dim=1.md)
  ```

  However, if you're reusing the same function and parameters multiple times,
  this incurs some overhead of around 50-500 nanoseconds per enqueue. So you
  can still compile the function first and pass it to `ctx.enqueue_function`
  in this scenario:

  ```mojo
  var compiled_func = ctx.compile_function[func]()
  # Multiple kernel launches with the same function/parameters
  ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
  ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
  ```

* Changed `Accelerator` and `CPU` from factory methods that created `Device`
  objects in Python (which were accelerators and CPUs in the C++ implementation)
  to actual Python types. This change elevates the `Accelerator` and `CPU` type
  concepts to Python, making them types rather than methods.

  This allows type annotations in Python. For example, a list of accelerators
  used to be defined like this:

  ```python
  graph_devices: list[DeviceRef]
  ```

  Now it can be defined like this:

  ```python
  graph_devices: list[Accelerator]
  ```

* Elementwise operations (e.g. `__add__`) have been removed from `Tensor` (that
  is, `tensor_internal.Tensor`). This `Tensor` type is being phased out; please
  reduce usage in favor of `LayoutTensor`.

### Graph APIs {#25-2-graph}

* The `nn` package is now [`max.nn`](https://docs.modular.com/max/api/python/nn.md).

* Added [`ops.chunk`](https://docs.modular.com/max/api/python/graph.md#max.graphs.ops.chunk)) to support
  chunking tensors along an axis.

* Added support for while loops with
  [`ops.while_loop`](https://docs.modular.com/max/api/python/graph.md#max.graphs.ops.while_loop).

* Added support for conditional execution with
  [`ops.cond`](https://docs.modular.com/max/api/python/graph.md#max.graph.ops.cond).

* Added axis reduction overloads for
  [`ops.min`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.min) and
  [`ops.max`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.max). For example;
  `ops.min(tensor, axis=-1)`.

* The [`gelu()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.gelu) function now
  accepts
  an `approximate` keyword. The keyword controls the `gelu` approximation with
  `none`, `tanh`, and `fast` approximations accepted.

* Removed the `roundeven()` operation from the Python API. The
  [`round()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.round) operation now has the
  same behavior as `roundeven()`, so there is no need for both to exist.

* Added helpers to create analogous tensors from buffer types and vice versa.

* Added `max.nn.Module`, a base class for writing layers and constructing
  networks of layers (e.g. using `max.nn.Sequential`). Currently, this class
  supports graph building by ensuring that all weight names are unique and
  systematically generated. This class also supports managing the weight values
  with the `module.state_dict()` and `module.load_state_dict()` methods. More
  functionality and documentation will be added in future releases.

### Custom ops {#25-2-custom-ops}

* Changes have been made to the way that custom ops are registered: rather
  than using the `num_dps_outputs` attribute on `@compiler.register` to specify
  the number of outputs, that number is now inferred from the signature of the
  custom operation. Inputs to the operation now use the `InputTensor` type and
  outputs from the operation use `OutputTensor`, instead of the previous
  `ManagedTensorSlice` for both. This eliminates the need for a manual
  `num_dps_outputs` attribute, and makes it safer to work with these inputs and
  outputs by preventing accidental writes to input tensors. The new interface
  looks something like the following:

  ```mojo
  @compiler.register("add_one_custom")
  struct AddOneCustom:
      @staticmethod
      fn execute[
          target: StringLiteral,
      ](https://docs.modular.com/max/changelog/out: OutputTensor,
          x: InputTensor[type = out.type, rank = out.rank],
          ctx: DeviceContextPtr,) raises:
          @parameter
          @always_inline
          fn elementwise_add_one[
              width: Int
          ](https://docs.modular.com/max/changelog/idx: IndexList[x.rank]) -> SIMD[x.type, width]:
              return x.load[width](https://docs.modular.com/max/changelog/idx.md) + 1

          foreach[elementwise_add_one, target=target](https://docs.modular.com/max/changelog/out, ctx.md)
  ```

* The `foreach` function now `raises` to be able to handle errors within an
  elementwise calculation.

### Hopper kernels {#25-2-hopper-kernels}

State-of-the-Art Kernels in Mojo for H100/H200 GPUs

* **Hopper Architecture Matrix Multiplication Kernels**: The implementation
  achieved performance comparable to NVIDIA's highly optimized cuBLAS library.
  These kernels take full advantage of the Tensor Cores in Hopper architecture
  GPUs to accelerate the fundamental matrix multiplication operations that
  underpin deep learning workloads.

* **Multi-GPU AllReduce Implementation**: The AllReduce operation is critical
  for
  distributed inference across multiple GPUs, as it efficiently aggregates
  gradients. The Mojo implementation surpassed NVIDIA's NCCL library in
  performance benchmarks. This improvement reduces communication overhead during
  distributed inference.

* **MAX Attention Kernel with Flash Attention 3:** This implementation
  incorporates the latest Flash Attention 3 algorithm and extends it, which
  significantly accelerates the computation of attention mechanisms in transformer
  models. The MAX attention kernel optimizes memory access patterns and
  computational steps, reducing both the memory footprint and execution time of
  attention operations. This is particularly important for LLMs where attention
  calculations represent a substantial portion of the computational workload.

### GPU programming {#25-2-gpu-programming}

* Added the Mojo `max.driver` API to enable dispatching
  GPU functions from Mojo.

Check out [examples for GPU programming in
Mojo](https://github.com/modular/modular/tree/main/mojo/examples/gpu-functions),
which use this new API.

### Mojo {#25-2-mojo}

Mojo is a crucial component of the MAX stack that enables all of MAX's
performance-oriented code across hardware. For all the updates to the Mojo
language, standard library, and tools, see the [Mojo
changelog](https://mojolang.org/releases).

### Documentation {#25-2-documentation}

New examples for writing custom ops:

* [`fused_attention`](https://github.com/modular/modular/blob/main/max/examples/custom_ops/kernels/fused_attention.mojo)
  demonstrates complex GPU programming using MAX abstractions for a
  practical use in AI model development.

* [`matrix_multiplication`](https://github.com/modular/modular/blob/main/max/examples/custom_ops/kernels/matrix_multiplication.mojo)
  includes a series of progressive optimizations for matrix multiplications
  on GPUs.

* [`histogram`](https://github.com/modular/modular/blob/main/max/examples/custom_ops/kernels/histogram.mojo)
  shows how to implement the histogram pattern as a custom op.

* New
  [examples for GPU programming in Mojo](https://github.com/modular/modular/tree/main/mojo/examples/gpu-functions)
  using the new MAX Driver API

  * These use a Mojo programming model that should look familiar to CUDA C
    programmers, showing how to define and dispatch GPU functions within a
    single Mojo file. These examples recreate the first three samples from the
    popular textbook
    ["Programming Massively Parallel Processors"](https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311),
    showing how basic concepts translate from CUDA into Mojo. Additionally, a
    Mandelbrot set calculation example that parallels a similar one in the
    existing custom ops examples.

* New [MAX containers](https://docs.modular.com/max/container.md) available. For
  more information on the base and full MAX containers, see [Container
  contents](https://docs.modular.com/max/container.md#container-contents).

## v25.1.1 (2025-02-19)

Fix performance issues in autoregressive models with paged attention
by setting sensible default values for `--max-num-steps` that are
platform-specific.

## v25.1 (2025-02-13)

* [Highlights](#25-1-highlights)
* [Documentation](#25-1-docs)
* [MAX Serve](#25-1-serve)
* [MAX models](#25-1-max-models)
* [MAX Engine](#25-1-engine)
  * [Graph APIs](#25-1-graph)
  * [Pipeline APIs](#25-1-pipelines)
  * [GPU programming](#25-1-gpus)
* [Mojo](#25-1-mojo)

### ✨ Highlights {#25-1-highlights}

* **Custom ops for GPUs**

  Our new custom op API allows you to extend MAX Engine with new graph
  operations written in Mojo that execute on either CPU or GPU, providing full
  composability and extensibility for your models. See more in the section
  about [GPU programming](#25-1-gpus).

* **Enhanced support for agentic workflows**

  MAX Serve now supports function calling, which allows you to instruct your
  model to interact with other systems, such as retrieve data and execute
  external tasks.
  [Learn more about function calling and tool use](https://docs.modular.com/max/serve/function-calling.md).

  MAX Serve now supports structured output (also known as constrained decoding)
  for MAX models on GPU. This allows you to enforce the output format from a
  model using an input schema that defines the output structure.
  [Learn more about structured output](https://docs.modular.com/max/serve/structured-output.md).

* **Extended model architecture support**

  * MAX Serve now supports multimodal models that take both text and image
    inputs. For example, see [how to deploy Llama 3.2
    Vision](https://docs.modular.com/max/develop/deploy-llama-vision.md).

  * MAX Serve now supports text embedding models. Learn how to [deploy a text
    embedding model](https://docs.modular.com/max/develop/run-embeddings-with-max-serve.md).

* **New `max-pipelines` CLI tool**

  Instead of cloning our GitHub repo to access our latest GenAI models, you can
  instead install the `max-pipelines` CLI tool and quickly run an inference or
  deploy an endpoint.

### Documentation {#25-1-docs}

New tutorials:

* [Build custom ops for GPUs](https://docs.modular.com/max/develop/build-custom-ops.md)

* [Serverless GPU inference on Google Cloud
  Run](https://docs.modular.com/max/develop/deploy-serverless-cloud-run.md)

* [Generate image descriptions with Llama 3.2
  Vision](https://docs.modular.com/max/develop/deploy-llama-vision.md)

* [Deploy a text embedding model](https://docs.modular.com/max/develop/run-embeddings-with-max-serve.md)

Other docs:

* [Function calling and tool use](https://docs.modular.com/max/serve/function-calling.md)

* [Structured output](https://docs.modular.com/max/serve/structured-output.md)

* [Prefix caching with PagedAttention](https://docs.modular.com/max/serve/prefix-caching.md)

* `max-pipelines` CLI

### MAX Serve {#25-1-serve}

* The `/v1/completions` REST endpoint now supports:

  * Pre-tokenized prompts.

  * Image inputs for multimodal models such as `Llama-3.2-11B-Vision-Instruct`.
    For an example, see [how to generate image
    descriptions with Llama 3.2 Vision](https://docs.modular.com/max/develop/deploy-llama-vision.md).

    **Known issue:** You might receive faulty results because some parts of the
    text prompt get ignored for certain input combinations. We've identified
    the problem and will have a fix in a subsequent nightly
    release.

  * Function calling and tool use, which allows you to instruct your
    model to interact with other systems, such as retrieve data and execute
    external tasks. [Learn more about function calling and tool
    use](https://docs.modular.com/max/serve/function-calling.md).

  * Structured output (also known as constrained decoding), which allows you to
    enforce the output format from a model using a JSON schema and the
    `response_format` field. To enable constrained decoding pass
    `--enable-structured-output` when running the server. However, this feature
    currently works for MAX models on GPU only (support for PyTorch models and
    CPU is in progress). [Learn more about structured
    output](https://docs.modular.com/max/serve/structured-output.md).

* Added support for the `/v1/embeddings` API endpoint, allowing you to generate
  vector representations using embedding models. See how to [deploy a text
  embedding model](https://docs.modular.com/max/develop/run-embeddings-with-max-serve.md).

* Max Serve can evict requests when the number of available pages in the
  PagedAttention KVCache is limited. Before, the KV manager would throw an OOM
  error when a batch that cannot fit in the cache was scheduled.

### MAX models {#25-1-max-models}

* Added the `max-pipelines` CLI tool that simplifies the
  process to run inference with GenAI models (specified with a Hugging Face repo
  ID) and deploy them to a local endpoint with MAX Serve.

  Previously, running or serving these models required cloning the
  [modular/max](https://github.com/modular/max) GitHub repo and then running
  commands such as `magic run llama3`.

  These model-specific commands like `llama3` and `replit` commands have been
  removed. They're now standardized and subsumed by flags like
  `--model-path` in the `max-pipelines` tool. Arguments such as
  `--max-length` and `--weight-path` are also still supported by
  `max-pipelines`.

  To view a list of supported model architectures from Hugging Face, run
  `max-pipelines list`.

* Added support for PagedAttention, which improves memory efficiency by
  partitioning the KV cache into smaller blocks, reducing fragmentation and
  enabling larger inference batches. You can enable it with
  `--cache-strategy=paged` and `--kv-cache-page-size` with a value that's a
  multiple of 128.

* Added support for prefix caching in all cases where PagedAttention is
  supported. This allows for more efficient usage of KVCache and improved prefill
  performance for workloads with common prefixes. You can enable it by setting
  `--enable-prefix-caching`. For more information, see [Prefix caching with
  PagedAttention](https://docs.modular.com/max/serve/prefix-caching.md).

* Batch size and max length are now inferred from available memory and the HF
  Models' default values for max length, respectively. If a configuration leads
  to an OOM, then we provide recommendations (to the best of our ability) to the
  user to fit the model into memory.

* Added support for heterogeneous KV caches for multi-modal models, such as
  Llama Vision, which cache different KV states for self and cross attention
  layers.

* Added support for embedding models, starting with MPNet. For example:

  ```shell
  max-pipelines generate \
    --model-path=sentence-transformers/all-mpnet-base-v2 \
    --prompt="Encode this sentence."
  ```

  Also see [how to deploy a text
  embedding model](https://docs.modular.com/max/develop/run-embeddings-with-max-serve.md).

* Added support for image and text multimodal models:

  * `max-pipelines generate` now accepts image input with `--image_url`.

  * Added an experimental Pixtral pipeline you can run as follows:

    ```shell
    max-pipelines generate \
      --model-path=mistral-community/pixtral-12b \
      --prompt="What is in this image? [IMG]" \
      --image_url=http://picsum.photos/1024/1024
    ```

    The pipeline is automatically used for all models implementing the
    `LlavaForConditionalGeneration` architecture.

    The implementation currently has a limit of one image. We plan support an
    arbitrary number of images of mixed sizes soon.

  * Added an experimental Llama Vision pipeline you can run as follows:

    ```shell
    max-pipelines generate \
      --model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \
      --prompt="<|image|><|begin_of_text|>What is in this image?" \
      --image_url=http://picsum.photos/1024/1024
    ```

    The pipeline is automatically used for all models implementing the
    `MllamaForConditionalGeneration` architecture.

    Note: This model is gated and requires that you set the
    [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken)
    environment variable. See
    [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).

  * See [how to generate image
    descriptions with Llama 3.2 Vision](https://docs.modular.com/max/develop/deploy-llama-vision.md).

* Added support for the `Qwen2ForCausalLM` model architecture (such as
  `Qwen/Qwen2.5-7B-Instruct`). For example:

  ```shell
  max-pipelines generate \
    --model-path=Qwen/Qwen2.5-7B-Instruct \
    --prompt="Write bubble sort in python" \
    --quantization-encoding bfloat16
  ```

* Added support for offline batched inference for text-based LLMs, allowing you
  to load a model and run inference with a batch of inputs directly from Python,
  instead of relying on an HTTP interface. For an example, see
  [`examples/offline-inference/basic.py`](https://github.com/modular/modular/blob/main/max/examples/offline-inference/basic.py).

* The `--max-cache-batch-size` flag has been deprecated in favor of
  `--max-batch-size`. Using `--max-cache-batch-size` now emits a deprecation
  warning and will stop working in a future release.

* The `--use-gpu` flag has been deprecated in favor of `--devices=cpu`,
  `--devices=gpu`, or `--devices=gpu-0,gpu-1,...`. If the device isn't specified,
  the model runs on the first available GPU, or CPU if no GPUs are available.

### MAX Engine {#25-1-engine}

* Improved internal kernel compilation speed 1.5 - 4X across different models.

  We've revamped our GPU compilation process so that all kernels in a program
  are compiled together into a single LLVM module, then split into separate
  kernels afterward. This ensures shared code between kernel entry points is
  only compiled once. For example, we observe a 3.7x speed up for Llama3.1-8b
  GPU startup time.

* Improved initial model execution speed on NVIDIA GPUs.

  Instead of compiling to PTX and performing just-in-time compilation during
  runtime, we now generate CUBIN binaries directly. While this increases
  initial compilation time, it significantly improves execution speed.

* The kernels have been further tuned for performance on NVIDIA A100 GPUs.

#### Graph APIs {#25-1-graph}

* You can now write custom operations (ops) in Mojo, and add them to a graph
  constructed in Python, using
  [`custom()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.custom) and
  [`inplace_custom()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.inplace_custom).

  For more detail, see the section below about [GPU programming](#25-1-gpus).

* Cached compiled MAX graphs that make use of custom operations now get
  invalidated when the implementation of the custom operations change.

* [`Graph.add_weight()`](https://docs.modular.com/max/api/python/generated/max.graph.Graph#max.graph.Graph.add_weight)
  now takes an explicit `device` argument. This enables explicitly passing
  GPU-resident weights to
  [`session.load()`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession.load) via
  the weights registry to initialize the model.

* [`max.graph.Weight`](https://docs.modular.com/max/api/python/generated/max.graph.Weight) now inherits
  from `TensorValue`, allowing you to call `weight.cast()` or `weight.T`. As such,
  the
  [`TensorValue`](https://docs.modular.com/max/api/python/generated/max.graph.TensorValue#max.graph.TensorValue)
  no longer accepts `Weight` for the `value` argument.

#### Pipeline APIs {#25-1-pipelines}

* `TextTokenizer.new_context()`
  now supports tool definitions passed through its `request` argument (via
  `TokenGeneratorRequest.tools`).

  * It also now supports JSON schemas passed through its `request` argument (via
    [`TokenGeneratorRequest.response_format`](https://docs.modular.com/max/api/python/pipelines.lib.interfaces#max.pipelines.interfaces.TokenGeneratorRequest.response_format)).

* Removed the default `num_steps` value for
  [`TokenGenerator.next_token()`](https://docs.modular.com/max/api/python/pipelines.lib.interfaces#max.pipelines.interfaces.TokenGenerator.next_token),
  ensuring users pass a value, reducing the potential for silent errors.

* `KVCacheStrategy`
  now defaults to `MODEL_DEFAULT`.

  As opposed to the previous setting which always used the "continuous" caching
  strategy, KV caching strategy is now defaulted on an architecture-specific
  basis to ensure the most optimized caching strategy is used.

* The
  [`Linear`](https://docs.modular.com/max/api/python/generated/max.nn.Linear)
  layer now has a `create()` class method that automatically creates
  specializations of `Linear` for non-quantized, k-quant, or GPTQ layers.

* Added
  [`nn.Conv1D`](https://docs.modular.com/max/api/python/generated/max.nn.Conv1D)
  for audio models like Whisper.

#### GPU programming {#25-1-gpus}

This release includes all new APIs to program on GPUs. The way to write code
for GPUs is to create custom operations with GPU functions that you can load
into a MAX graph. This foundational API includes a few key components:

* Mojo APIs to write custom op functions:

  * The [`@compiler.register`](https://docs.modular.com/max/api/mojo-decorators/compiler-register.md)
    decorator is applied to a Mojo struct that implements a custom op in an
    `execute()` function—for either CPU or GPU—and a `shape()` function that
    defines the custom op's output tensor.

  * The [`max.tensor`](https://docs.modular.com/max/api/kernels/extensibility/tensor.md) package adds
    essential Mojo APIs for writing custom ops, such as:

    * The
      [`foreach()`](https://docs.modular.com/max/api/kernels/extensibility/tensor/managed_tensor_slice/foreach.md)
      function, which efficiently executes an element-wise computation in parallel
      on either a GPU or CPU.

    * The
      [`ManagedTensorSlice`](https://docs.modular.com/max/api/kernels/extensibility/tensor/managed_tensor_slice/ManagedTensorSlice.md)
      type defines the input and output tensors for the custom op.

* Python APIs to load custom ops into a model:

  * The [`custom()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.custom) and
    `inplace_custom()`
    functions allow you to add the previously-defined Mojo custom op to a MAX
    graph written in Python.

  * The [`InferenceSession`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession)
    constructor accepts the custom op implementation as a
    [Mojo package](https://mojolang.org/docs/manual/packages#mojo-packages) in
    the `custom_extensions` argument.

For more detail, see the [tutorial to build custom ops for
GPUs](https://docs.modular.com/max/develop/build-custom-ops.md).

Additionally, we've added a new
[`gpu` package](https://mojolang.org/docs/std/gpu/) to the Mojo standard library
that provides low-level programming constructs for working with GPUs. These APIs
let you do things that you can't currently do with the high-level `foreach()`
abstraction above. The Mojo `gpu` APIs allow you to manually manage interaction
between the CPU host and GPU device, manage memory between devices, synchronize
threads, and more. For some examples, see
[`vector_addition.mojo`](https://github.com/modular/modular/blob/main/max/examples/custom_ops/kernels/vector_addition.mojo)
and
[`top_k.mojo`](https://github.com/modular/modular/blob/main/max/examples/custom_ops/kernels/top_k.mojo).

### Mojo {#25-1-mojo}

Mojo is a crucial component of the MAX stack that enables all of MAX's
performance-oriented code across hardware. For all the updates to the Mojo
language, standard library, and tools, see the [Mojo
changelog](https://mojolang.org/releases).

## v24.6 (2024-12-17)

This is a huge update that offers a first look at our serving library for
MAX on GPUs!

* [Highlights](#24-6-highlights)
* [Documentation](#24-6-docs)
* [MAX Serve](#24-6-serve)
* [MAX models](#24-6-models)
* [MAX Engine](#24-6-engine)
  * [Driver APIs](#24-6-driver-api)
  * [Graph compiler](#24-6-graph-compiler)
  * [Graph APIs](#24-6-graph-api)
  * [Custom op registration](#24-6-custom-ops)
  * [Numeric kernels](#24-6-kernels)
* [Mojo](#24-6-mojo)

Also check out our
[blog post introducing MAX 24.6](https://www.modular.com/blog/introducing-max-24-6-a-gpu-native-generative-ai-platform).

### ✨ Highlights {#24-6-highlights}

* **MAX Engine on GPUs preview**

  We're excited to share a preview of MAX Engine on GPUs. We've created a few
  tutorials that demonstrate MAX's ability to run GenAI models with our
  next-generation MAX graph compiler on NVIDIA GPU architectures (including
  A100, A10, L4, and L40 GPUs). You can experience it today by [deploying
  Llama 3 on an A100 GPU](https://docs.modular.com/max/deploy/local-to-cloud.md).

* **MAX Serve preview**

  This release also includes an all-new serving interface called MAX
  Serve. It's a Python-based serving layer that supports both
  native MAX models when you want a high-performance deployment, and
  off-the-shelf PyTorch LLMs from Hugging Face when you want to explore and
  experiment—all with GPU support. It provides an OpenAI-compatible REST
  endpoint for inference requests, and a Prometheus-compatible metrics
  endpoint. You can use a `magic` command to start a local server , or use our
  ready-to-deploy MAX container to start an endpoint in the cloud. Try it now
  [with an LLM from Hugging Face](https://docs.modular.com/max/deploy/local-to-cloud.md).

* **Upgraded MAX models**

  As we continue to build our Python-based MAX Graph API that allows you to
  build high-performance GenAI models, we've made a ton of performance
  improvements to the existing models and added a few new models to our GitHub
  repo. All the Python-based MAX models now support GPUs and broad model
  architectures. For example,
  [`llama3`](https://github.com/modular/modular/tree/main/max/pipelines/architectures/llama3)
  adds compatibility for the LlamaForCausalLM family, which includes over
  20,000 model variants and weights on Hugging Face.

### Documentation {#24-6-docs}

New tutorials:

* [Deploy Llama 3 on GPU with MAX
  Serve](https://docs.modular.com/max/deploy/local-to-cloud.md)

* [Deploy Llama 3.1 on GPU-powered Kubernetes
  clusters](https://docs.modular.com/max/develop/deploy-max-serve-on-kubernetes.md)

* [Get started with MAX Graph in
  Python](https://docs.modular.com/max/develop/get-started-with-max-graph-in-python.md)

Other new docs:

* [MAX container](https://docs.modular.com/max/container.md)

* [Benchmark MAX
  Serve](https://github.com/modular/modular/tree/main/benchmark)

Also, our documentation is now available for **MAX nightly builds**! If you're
building with a nightly
release, you can
switch to see the nightly docs using a toggle to the right of the search bar.

### MAX Serve {#24-6-serve}

This release includes a preview of our Python-based serving library called MAX
Serve. It simplifies the process to deploy your own inference
server with consistent and reliable performance.

MAX Serve currently includes the following features:

* Deploys locally and to the cloud with our [MAX container
  image](https://docs.modular.com/max/container.md), or with the `magic` CLI.

* An OpenAI-compatible server with streaming `/chat/completion` and
  `/completion` endpoints for LLM inference requests.

* Prometheus-compatible [metrics endpoint](https://docs.modular.com/max/container.md#metrics) with LLM
  KPIs (TTFT and ITL) for monitoring and evaluating performance.

* Supports most `TextGeneration` Hugging Face Hub models.

* Multiprocess HTTP/model worker architecture to maximize CPU core utilization
  by distributing multiple incoming requests across multiple processes, ensuring
  both high throughput and responsiveness.

* Continuous heterogeneous batching to combine multiple incoming requests into
  a single inference (no waiting to fill a batch size) and improve total
  throughput.

There's much more still in the works for MAX Serve, but you can try it today
with our tutorials to [Deploy Llama 3 on GPU with MAX
Serve](https://docs.modular.com/max/deploy/local-to-cloud.md).

**Known issues:**

* While this release is enough to support typical chatbot applications,
  this release does not yet support the function-calling portion of the
  OpenAI API specification needed to enable robust agentic workflows.

* Sampling is still limited and doesn't currently respect temperature or
  other sampling-related API request input.

* Structured generation is not supported.

* Support for multi-modal models is still nascent.

### MAX models {#24-6-models}

All of our Python-based GenAI
[models on GitHub](https://github.com/modular/modular/tree/main/max/pipelines/architectures)
now support GPUs!

As we add more models, we're also building a robust set of libraries and
infrastructure that make it easier to build and deploy a growing library of
LLMs. Some of which is available in a new
[`max.pipelines`](https://docs.modular.com/max/api/python/pipelines.md) package and some of it is alongside
the
[models on GitHub](https://github.com/modular/modular/tree/main/max/pipelines/architectures).
Here are just some of the highlights:

* Deep integration with the Hugging Face ecosystem for a quick-to-deploy
  experience, such as using HF Model Hub tools to fetch config files, support for
  weights in [safetensor](https://github.com/huggingface/safetensors) format,
  support for HF tokenizers, and more. (We also support GGUF weight formats.)

* Expanded set of model abstractions for use by different LLM architectures:

  * Attention layers (including highly optimized implementations with
    configurable masking, like
    [`AttentionWithRope`](https://github.com/modular/modular/tree/main/max/nn/attention/attention_with_rope.py)).
    The optimized attention layers include variants that accept an attention
    mask. More memory-efficient variants that don't take a mask instead take a
    "mask functor" argument to the kernel, which implements masking without
    materializing a mask by computing a mask value from input coordinates on the
    fly.

  * Transformers such as
    [`Transformer` and `TransformerBlock`](https://github.com/modular/modular/tree/main/max/nn/transformer/transformer.py).
    These include an initial implementation of ragged tensors—tensors for which
    each dimension can have a different size, avoiding the use of padding tokens
    by flattening a batch of sequences of differing lengths.

  * Common layers such as
    [`RMSNorm`](https://github.com/modular/modular/tree/main/max/nn/norm/rms_norm.py),
    [`Embedding`](https://github.com/modular/modular/tree/main/max/nn/embedding.py),
    and
    [`Sequential`](https://github.com/modular/modular/tree/main/max/nn/sequential.py).

  * KV cache management helpers, like
    `ContinuousBatchingKVCacheManager`.

  * Low-level wrappers over optimized kernels like
    [`fused_qk_ragged_rope`](https://github.com/modular/modular/tree/main/max/nn/kernels.py).
    These are custom fused kernels that update the KV cache in place. Although
    they are custom, they reuse the underlying kernel implementation by passing
    in lambda functions used to retrieve inputs and write to outputs in place.

* Added generalized interfaces for text generation such as
  [`TokenGenerator`](https://docs.modular.com/max/api/python/pipelines.lib.interfaces#max.pipelines.interfaces.TokenGenerator)
  and [`PipelineModel`](https://docs.modular.com/max/api/python/generated/max.pipelines.PipelineModel),
  which provide modularity within the models and serving infrastructure. Also
  added a plug-in mechanism
  ([`PipelineRegistry`](https://docs.modular.com/max/api/python/generated/max.pipelines.lib.registry.PipelineRegistry))
  to more quickly define new models, tokenizers, and other reusable components.
  For example, anything that conforms to
  [`TokenGenerator`](https://docs.modular.com/max/api/python/pipelines.lib.interfaces#max.pipelines.interfaces.TokenGenerator)
  can be served using the LLM infrastructure within MAX Serve. We then used this
  interface to create the following:

  * An optimized
    [`TextGenerationPipeline`](https://docs.modular.com/max/api/python/generated/max.pipelines.TextGenerationPipeline)
    that can be combined with any compatible graph and has powerful performance
    features like graph-based multi-step scheduling, sampling, KV cache
    management, ragged tensor support, and more.

  * A generic
    `HFTextGenerationPipeline`
    that can run any Hugging Face model for which we don't yet have an optimized
    implementation in eager mode.

* Models now accept weights via a weights registry, which is passed to the
  [`session.load()`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession.load)
  method's `weights_registry` argument. The decoupling of weights and model
  architecture allows implementing all of the different fine-tunes for a given
  model with the same graph. Furthermore, because the underlying design is
  decoupled, we can later expose the ability to compile a model once and swap
  weights out on the fly, without re-compiling the model.

* Added generic implementations of common kernels, which allow you to plug-in
  different batching strategies (ragged or padded), KV cache management
  approaches (continuous batching), masking (causal, sliding window, etc.), and
  position encoding (RoPE or ALIBI) without having to re-write any kernel code.
  (More about this in a future release.)

* Multi-step scheduling to run multiple token-generation steps on GPU before
  synchronizing to the CPU.

**Updated models:**

* Significant performance upgrades for [Llama
  3](https://github.com/modular/modular/tree/main/max/pipelines/architectures/llama3),
  and expanded compatibility with the `LlamaForCausalLM` models family. For
  example, it also supports Llama 3.2 1B and 3B text models.

**New models:**

* [Mistral
  NeMo](https://github.com/modular/modular/tree/main/max/pipelines/architectures/mistral)
  (and other `MistralForCausalLM` models)

* [Replit Code V1.5
  3B](https://github.com/modular/modular/tree/main/max/pipelines/architectures/replit)

**Known issues:**

* The Q4 quantized models currently work on CPU only.

* Using a large setting for `top-k` with the Llama 3.1 model may lead to
  segmentation faults for certain workloads when run on NVIDIA GPUs. This should
  be resolved in the latest nightly MAX builds.

* The models currently use a smaller default context window than the
  `max_seq_len` specified in the Hugging Face configuration files for a given
  model. This can be manually adjusted by setting the `--max-length` parameter to
  the desired context length when serving a model.

* Some variants of the supported core models (like `LlamaForCausalLM` with
  different number of heads, head sizes, etc.) might not be fully optimized yet.
  We plan to fully generalize our implementations in a future release.

### MAX Engine {#24-6-engine}

MAX Engine includes a lot of the
core infrastructure that enables MAX to accelerate AI models on any hardware,
such as the graph compiler, runtime, kernels, and the APIs to interact with it
all, and it all works without external dependencies such as PyTorch or CUDA.

This release includes a bunch of performance upgrades to our graph compiler and
runtime. We've added support for NVIDIA GPU architectures (including A100, A10,
L4, and L40 GPUs), and built out new infrastructure so we can quickly add
support for other GPU hardware.

**Engine API changes:**

* [`InferenceSession`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession)
  now accepts a `custom_extensions` constructor argument, same as `load()`, to
  specify model extension libraries.

* The [`Model`](https://docs.modular.com/max/api/python/engine.md#max.engine.Model) object is now callable
  to run an inference.

**Breaking changes**:

* `Model.execute()` signature changed to support GPUs.

  * The [`execute()`](https://docs.modular.com/max/api/python/engine.md#max.engine.Model.execute) function
    currently doesn't accept keyword arguments. Instead you can pass tensors as
    a [`driver.Tensor`](https://docs.modular.com/max/api/python/driver.md#max.driver.Tensor), `int`,
    `float`, `bool`,
    [`np.generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic),
    or [`DLPackArray`](https://docs.modular.com/max/api/python/driver.md#max.driver.DLPackArray)
    ([DLPack](https://github.com/dmlc/dlpack)). Note that both PyTorch and NumPy
    arrays implement the DLPack protocol, which means you can also pass either
    of those types to `execute()`.

  * [`execute_legacy()`](https://docs.modular.com/max/api/python/engine.md#max.engine.Model.execute_legacy)
    preserves the semantics of `execute()` with support for keyword arguments to
    help with migration, but will be removed in a future release.
    `execute_legacy()` doesn't support GPUs.

  * Calling `execute()` with positional arguments still works the same.

#### Driver APIs {#24-6-driver-api}

MAX Driver (the [`max.driver`](https://docs.modular.com/max/api/python/driver.md) module) is a new
component of MAX Engine that's still a work in progress. It provides primitives
for working with heterogeneous hardware systems (GPUs and CPUs), such as to
allocate on-device memory, transfer data between host and device, query device
stats, and more. It's a foundation on which other components of MAX Engine
operate (for example, `InferenceEngine` now uses
[`driver.Tensor`](https://docs.modular.com/max/api/python/driver.md#max.driver.Tensor) to handle model
inputs and outputs).

**Driver API changes:**

* Added `CUDA()` device to open an NVIDIA GPU.

* Added support for fp16 and bfloat16 dtypes.

* Expanded functionality for `max.driver.Device`, with new class methods and
  properties. We are still working on building this out to support more
  accelerator features.

* [`driver.Tensor`](https://docs.modular.com/max/api/python/driver.md#max.driver.Tensor) (and the
  `InferenceSession.load()` argument `weights_registry` ) now supports zero-copy
  interoperability with NumPy arrays and PyTorch tensors, using
  [DLPack](https://github.com/dmlc/dlpack) /
  [`DLPackArray`](https://docs.modular.com/max/api/python/driver.md#max.driver.DLPackArray).

* [`driver.Tensor`](https://docs.modular.com/max/api/python/driver.md#max.driver.Tensor) has new methods,
  such as `from_dlpack()`, `element_size()` , `to()`, `to_numpy()`, `view()`,
  `zeros()`, and more.

MAX Driver APIs are still changing rapidly and not yet ready for general use.
We'll publish more documentation in a future release.

**Known issues:**

* MAX Driver is currently limited to managing just one NVIDIA GPU at a time (it
  does not yet support multi-GPU). It also does not yet support remote devices.

* DLPack support is not complete. For example, streams are not yet supported.

#### Graph compiler {#24-6-graph-compiler}

When you load a model into MAX Engine, the graph compiler is the component that
inspects and optimizes all graph operations (ops) to deliver the best run time
performance on each device.

This release includes various graph compiler improvements:

* Major extensions to support NVIDIA GPUs (and other devices in the future),
  including async copies and caching of JIT'd kernels.

* The runtime now performs scheduling to enable GPU compute overlap with the
  CPU.

* New transformations to the Mojo kernels to enable a number of optimizations,
  including specialization on tensor dimensions, specialization on target
  hardware, specialization on non-tensor dimension input to kernels, automatic
  kernel fusion between operators, and more.

* New algebraic simplifications and algorithms for ops such as horizontal
  fusion of matrix multiplications.

* New CPU-side primitives for device management that are automatically
  transformed and optimized to reduce overhead (MAX does not need to use things
  like CUDA Graphs).

* Updated memory planning to preallocate device memory (hoist computation from
  inference runtime to initialization time) and reduce per-inference overhead.

#### Graph APIs {#24-6-graph-api}

The graph compiler is also exposed through the MAX Graph APIs (the
[`max.graph`](https://docs.modular.com/max/api/python/graph.md) package), which allow you to build
high-performance GenAI models in Python.

**Graph API changes:**

* Python stack traces from model execution failures now include a trace to the
  original op-creation, allowing for easier debugging during development.

* The [`max.graph`](https://docs.modular.com/max/api/python/graph.md) APIs now include preliminary
  support for symbolic algebraic expressions using
  [`AlgebraicDim`](https://docs.modular.com/max/api/python/generated/max.graph.AlgebraicDim),
  enabling more powerful support for checked dynamic shapes. This allows
  `-Dim("x") - 4`. Furthermore, the algebraic expressions simplify to a canonical
  form, so that for example `-Dim("x") - 4 == -(Dim("x") + 4)` holds.

* More advanced dtype promotion now allows
  [`TensorValue`](https://docs.modular.com/max/api/python/generated/max.graph.TensorValue)
  math operators to just work when used with NumPy arrays and
  python primitives.

* [`TensorValue`](https://docs.modular.com/max/api/python/generated/max.graph.TensorValue)
  has new methods, such as `broadcast_to()`, `cast()`,
  `flatten()`, `permute()`, and more.

* Added
  [`BufferValue`](https://docs.modular.com/max/api/python/generated/max.graph.BufferValue),
  which allows for device-resident tensors that are read and
  mutated within the graph.

* [`DType`](https://docs.modular.com/max/api/python/dtype.md#max.dtype.DType) has new methods/properties,
  `align`, `size_in_bytes`, and `is_float()`.

* [`Value`](https://docs.modular.com/max/api/python/generated/max.graph.Value) constructor
  accepts more types for `value`.

* [`TensorValue`](https://docs.modular.com/max/api/python/generated/max.graph.TensorValue)
  constructor accepts more types for `value`.

* [`TensorValue.rebind()`](https://docs.modular.com/max/api/python/generated/max.graph.TensorValue#max.graph.TensorValue.rebind)
  accepts a new `message` argument.

**Breaking changes:**

* [`Graph.add_weight()`](https://docs.modular.com/max/api/python/generated/max.graph.Graph#max.graph.Graph.add_weight)
  now accepts
  [`Weight`](https://docs.modular.com/max/api/python/generated/max.graph.Weight#max.graph.Weight)
  and returns
  [`TensorValue`](https://docs.modular.com/max/api/python/generated/max.graph.TensorValue).
  [`Weight`](https://docs.modular.com/max/api/python/generated/max.graph.Weight#max.graph.Weight)
  is essentially a
  named placeholder for a tensor that knows its name, dtype, shape, and
  optionally device and quantization encoding. `Graph.add_weight()` stages an op
  in the graph that is populated by a named weight in the weights registry passed
  to `session.load`.

* The [`Weight`](https://docs.modular.com/max/api/python/generated/max.graph.Weight#max.graph.Weight)
  constructor
  arguments changed; added `align` , `dtype` , and `shape`; removed `assign` ,
  `filepath`, `offset`, and `value`.

* The `ops.scalar()` method was removed along with the `is_static()` and
  `is_symbolic()` methods from all `graph.type` objects.

  * Instead of `ops.scalar()`, use
    [`ops.constant()`](https://docs.modular.com/max/api/python/graph.ops#max.graph.ops.constant).

  * Instead of `is_static()` and `is_symbolic()`, use
    `isinstance(dim, SymbolicDim)` and `isinstance(dim, StaticDim)`.

The MAX Graph APIs are not ready for general use but you can [experiment with
it now by following this
tutorial](https://docs.modular.com/max/develop/get-started-with-max-graph-in-python.md). We'll add more
documentation when we finish some API redesigns.

#### Custom op registration {#24-6-custom-ops}

Although the APIs to write custom operators (ops) isn't ready for general use,
this release includes a significant redesign that lays the groundwork. You
might notice some associated APIs in this release and more APIs in the
nightlies, so here's a little about the work in progress:

* The custom op APIs will allow you to extend MAX Engine with new ops written
  in Mojo, providing full composability and extensibility for your models. It's
  the exact same API we use to write MAX Engine's built-in ops such as `matmul`.
  That means your custom ops can benefit from all our compiler optimization
  features such as kernel fusion—your ops are treated the same as all the ops
  included "in the box."

* The new API requires far less adornment at the definition site to enable the
  MAX model compiler to optimize custom ops along with the rest of the graph
  (compared to our previous version that used `NDBuffer`).

* Custom ops support "destination passing style" for tensors.

* The design composes on top of Mojo's powerful meta programming, as well as
  the kernel libraries abstractions for composable kernels.

We'll publish more documentation when the custom op API is ready for general
use. Check out the MAX repo's `nightly` branch to see the latest [custom op
examples](https://github.com/modular/modular/tree/main/max/examples/custom_ops).

**Known issues:**

* Custom ops don't have type or lifetime checking. They also don't reason about
  mutability. Expect lots of sharp corners and segfaults if you hold them wrong
  while we improve this!

#### Numeric kernels {#24-6-kernels}

The GPU kernels for MAX Engine are built from the ground up in Mojo with no
dependencies on external vendor code or libraries. This release includes the
following kernel improvements:

* AttenGen: a novel way to express attention pattern that's able to express
  different attention masks, score functions, as well as caching strategies.

* State-of-the-art matrix multiplication algorithms with optimizations such as
  the following:

  * Pipelining and double-buffering to overlap data transfer and computation
    and to hide memory access latency (for both global and shared memory).

  * Thread swizzling to avoid shared memory bank conflicts associated with
    tensor core layouts.

  * Block swizzling to increase L2 cache locality.

* SplitK/StreamK GEMM algorithms: divides the computation along the shared K
  dimension into smaller matrices which can then be executed independently on
  streaming multiprocessors (such as CUDA cores). These algorithms are ideal for
  matrices with large K dimension but small M dimension.

* Large context length MHA: uses SplitK/StreamK to implement the attention
  mechanism and eliminate the need of a huge score matrix, which drastically
  reduces memory usage/traffic to enable large context length.

* DualGemm: accelerates the multi-layer perceptron (MLP) layers where the
  left-hand side (LHS) is shared between two matrix multiplications.

**Known issues:**

* The MAX kernels are optimized for bfloat16 on GPUs.

* Convolution on GPU is not performance optimized yet.

* Although v24.6 technically runs on H100, it doesn't include
  performance-optimized kernels for that device yet and it isn't recommended.

### Mojo {#24-6-mojo}

Mojo is a crucial component of the MAX stack that enables all of MAX's
performance-oriented code across hardware. For all the updates to the Mojo
language, standard library, and tools, see the [Mojo
changelog](https://mojolang.org/releases#v246-2024-12-17).

## v24.5 (2024-09-13)

### ✨ Highlights

* Mojo and MAX are magical! We've created a new package and virtual environment
  manager, `magic`, for MAX and Mojo.

* New
  [Llama3.1 pipeline](https://github.com/modular/modular/tree/main/max/pipelines/architectures)
  built with the new MAX Graph Python API.

* We have not one, but two new Python APIs that we're introducing in this
  release:
  * [MAX Graph Python API](#max-graph-python-api)
  * [MAX Driver Python API](#max-driver-python-api)

### ⭐️ New

* Added `repeat_interleave` graph op.

* Added caching for MAX graph models.
  This means that graph compilation is cached and the executable model is
  retrieved from cache on the 2nd and subsequent runs.
  Note that the model cache is architecture specific and isn't portable across
  different targets.

* Support for Python 3.12.

#### MAX Graph Python API

This Python API
will ultimately provide the same low-level programming interface for
high-performance inference graphs as the Mojo API. As with the Mojo API, it's an
API for graph-building only, and it does not implement support for training.

You can take a look at how the API works in the
[MAX Graph Python API reference](https://docs.modular.com/max/api/python/graph.md).

#### MAX Driver Python API

The MAX Driver API allows you to interact with devices (such as CPUs and GPUs)
and allocate memory directly onto them. With this API, you interact with
this memory as tensors.

Note that this API is still under development, with support for non-host
devices, such as GPUs, planned for a future release.

To learn more, check out the
[MAX Driver Python APIreference](https://docs.modular.com/max/api/python/driver.md).

#### MAX C API

New APIs for adding torch metadata libraries:

* `M_setTorchMetadataLibraryPath`
* `M_setTorchMetadataLibraryPtr`

### 🦋 Changed

#### MAX Engine performance

* Compared to v24.4, MAX Engine v24.5 generates tokens for Llama an average of
  15%-48% faster.

#### MAX C API

Simplified the API for adding torch library paths, which now only takes one path
per API call, but can be called multiple times to add paths to the config:

* `M_setTorchLibraries` -> `M_setTorchLibraryPath`

### ⚠️ Deprecated

* The `max` command line tool is no longer supported and will be removed
  in a future release.

### ❌ Removed

* Dropped support for Ubuntu 20.04. If you're using Ubuntu, we currently
  support Ubuntu 22.04 LTS only.
* Dropped support for Python 3.8.
* Removed built-in PyTorch libraries from the max package. See the
  [FAQ](https://docs.modular.com/max/faq.md) for information on supported torch versions.

## v24.4 (2024-06-07)

### 🔥 Legendary

* MAX is now available on macOS! [Try it now](https://docs.modular.com/max.md).

* New quantization APIs for MAX Graph. You can now build high-performance
  graphs in Mojo that use the latest quantization techniques, enabling even
  faster performance and more system compatibility for large models.

  Learn more in the guide to [quantize your graph weights](https://docs.modular.com/max/graph/quantize.md).

### ⭐️ New

#### MAX Mojo APIs

* Added AI pipeline examples in the `max` repo, with Mojo implementations for
  common transformer layers, including quantization support.

  * New Llama3 pipeline built with MAX Graph.

  * New Replit Code pipeline built with MAX Graph.

  * New TinyStories pipeline (based on TinyLlama) that offers a simple demo of
    the MAX Graph quantization API.

* Added `max.graph.checkpoint` package
  to save and load model weights.

  All weights are stored in a
  `TensorDict`.
  You can save and load a `TensorDict` to disk with
  `save()` and
  `load()` functions.

* Added MAX Graph quantization APIs:

  * Added quantization encodings
    `BFloat16Encoding`,
    `Q4_0Encoding`,
    `Q4_KEncoding`,
    and
    `Q6_KEncoding`.
  * Added the
    `QuantizationEncoding`
    trait so you can build custom quantization encodings.
  * Added `Graph.quantize()`
    to create a quantized tensor node.
  * Added `qmatmul()` to
    perform matrix-multiplication with a float32 and a quantized matrix.

* Added some MAX Graph ops:

  * `avg_pool()`
  * `max_pool()`
  * `conv2d()`
  * `conv3d()`
  * `layer_norm()`
  * `tile()`
  * `select()`

* Added a `layer()` context
  manager and
  `current_layer()`
  function to aid in debugging during graph construction. For example:

  ```mojo
  with graph.layer("foo"):
      with graph.layer("bar"):
          print(graph.current_layer())  # prints "foo.bar"
          x = graph.constant[DType.int64](https://docs.modular.com/max/changelog/1.md)
          graph.output(x)
  ```

  This adds a path `foo.bar` to the added nodes, which will
  be reported during errors.

* Added
  `format_system_stack()`
  function to format the stack trace, which we use to print better error
  messages from `error()`.

* Added
  `TensorMap.keys()` to
  get all the tensor key names.

#### MAX C API

Miscellaneous new APIs:

* `M_cloneCompileConfig()`
* `M_copyAsyncTensorMap()`
* `M_tensorMapKeys()` and `M_deleteTensorMapKeys()`
* `M_setTorchLibraries()`

### 🦋 Changed

#### MAX Mojo API

* `EngineNumpyView.data()`
  and `EngineTensorView.data()`
  functions that return a type-erased pointer were renamed to `unsafe_ptr()`.

* `TensorMap` now conforms
  to `CollectionElement` trait to be copyable and movable.

* `custom_nv()` was removed, and its functionality moved into
  `custom()` as a function
  overload, so it can now output a list of tensor symbols.

## v24.3 (2024-05-02)

### 🔥 Legendary

* You can now write custom ops for your models with Mojo!

  Learn more about [MAX extensibility](https://docs.modular.com/max/develop/custom-ops.md).

### 🦋 Changed

* Added support for named dynamic dimensions. This means you can specify when
  two
  or more dimensions in your model's input are dynamic but their sizes at run
  time must match each other. By specifying each of these dimension sizes with a
  name (instead of using `None` to indicate a dynamic size), the MAX Engine
  compiler can perform additional optimizations. See the notes below for the
  corresponding API changes that support named dimensions.

* Simplified all the APIs to load input specs for models, making them more
  consistent.

#### MAX Engine performance

* Compared to v24.2, MAX Engine v24.3 shows an average speedup of 10% on PyTorch
  models, and an average 20% speedup on dynamically quantized ONNX transformers.

#### MAX Graph API

The `max.graph` APIs are still changing
rapidly, but starting to stabilize.

* `AnyMoType` renamed to `Type`,
  `MOTensor` renamed to
  `TensorType`, and `MOList`
  renamed to `ListType`.

* Removed `ElementType` in favor of using `DType`.

* Removed `TypeTuple` in favor of using `List[Type]`.

* Removed the `Module` type so you can now start building a graph by directly
  instantiating a `Graph`.

* Some new ops in `max.ops`, including
  support for custom ops.

  See how to [create a custom op in MAX
  Graph](https://docs.modular.com/max/develop/build-custom-ops.md).

#### MAX Engine Python API

* Redesigned
  [`InferenceSession.load()`](https://docs.modular.com/max/api/python/engine.md#max.engine.InferenceSession.load)
  to replace the confusing `options` argument with a `custom_ops_path` argument.

  As a result, `CommonLoadOptions`, `TorchLoadOptions`, and
  `TensorFlowLoadOptions` have all been removed.

* [`TorchInputSpec`](https://docs.modular.com/max/api/python/engine.md#max.engine.TorchInputSpec)
  now supports named dynamic dimensions (previously, dynamic dimension sizes
  could be specified only as `None`). This lets you tell MAX which dynamic
  dimensions are required to have the same size, which helps MAX better optimize
  your model.

#### MAX Engine Mojo API

* `InferenceSession.load_model()` was renamed to
  `load()`.

* Redesigned
  `InferenceSession.load()`
  to replace the confusing `config` argument with a `custom_ops_path` argument
  for use when [loading a custom op](https://docs.modular.com/max/develop/build-custom-ops.md), and an
  `input_specs` argument for use when loading TorchScript models.

  Doing so removed `LoadOptions` and introduced the new
  `InputSpec` type to define
  the input shape/type of a model (instead of `LoadOptions`).

* New `ShapeElement`
  type to allow for named dynamic dimensions (in `InputSpec`).

* `max.engine.engine` module was renamed to
  `max.engine.info`.

#### MAX Engine C API

* [`M_newTorchInputSpec()`](https://docs.modular.com/max/api/c/pytorch/config.md#m_newtorchinputspec)
  now supports named dynamic dimensions (via new `dimNames` argument).

### ❌ Removed

* Removed TensorFlow support in the MAX SDK, so you can no longer load a
  TensorFlow SavedModel for inference. However, TensorFlow is still available for
  enterprise customers.

  We removed TensorFlow because industry-wide TensorFlow usage has declined
  significantly, especially for the latest AI innovations. Removing TensorFlow
  also cuts our package size by over 50% and accelerates the development of
  other customer-requested features. If you have a production use-case for a
  TensorFlow model, please [contact
  us](https://www.modular.com/request-demo).

* Removed the Python `CommonLoadOptions`, `TorchLoadOptions`, and
  `TensorFlowLoadOptions` classes. See note above about
  `InferenceSession.load()` changes.

* Removed the Mojo `LoadOptions` type. See the note above about
  `InferenceSession.load()` changes.

## v24.2.1 (2024-04-11)

* You can now import more MAX Graph functions from `max.graph.ops` instead of
  using `max.graph.ops.elementwise`. For example:

  ```mojo
  from max.graph import ops

  var relu = ops.relu(matmul)
  ```

## v24.2 (2024-03-28)

* MAX Engine now supports TorchScript models with dynamic input shapes.

  No matter what the input shapes are, you still need to specify the input
  specs for all TorchScript models.

* The Mojo standard library is now open source!

  Read more about it in [this blog
  post](https://www.modular.com/blog/the-next-big-step-in-mojo-open-source).

* And, of course, lots of Mojo updates, including implicit traits, support for
  keyword arguments in Python calls, a new `List` type (previously
  `DynamicVector`), some refactoring that might break your code, and much more.

  For details, see the
  [Mojo changelog](https://mojolang.org/releases#v242-2024-03-28).

## v24.1.1 (2024-03-18)

This is a minor release that improves error reports.

## v24.1 (2024-02-29)

The first release of the MAX platform is here! 🚀

This is a **preview version** of the MAX platform. That means it
is not ready for production deployment and designed only for local development
and evaluation.

Because this is a preview, some API libraries are still in development and
subject to change, and some features that we previously announced are not quite
ready yet. But there is a lot that you can do in this release!

This release includes our flagship developer tools, currently for **Linux
only**:

* **MAX Engine**: Our state-of-the-art graph compiler and runtime library that
  executes models from PyTorch and ONNX, with incredible inference
  speed on a wide range of hardware.

  * API libraries in Python, C, and Mojo to run inference with your existing
    models. [See the API references](https://docs.modular.com/max/api.md).

  * The `max benchmark` tool, which runs MLPerf
    benchmarks on any compatible model without writing any code.

  * The `max visualize` tool, which allows you to visualize
    your model in Netron after partially lowering in MAX Engine.

  * An early look at the MAX Graph API, our
    low-level library for building high-performance inference graphs.

* **MAX Serving**: A preview of our serving wrapper for MAX Engine that
  provides full interoperability with existing AI serving systems (such as
  Triton) and that seamlessly deploys within existing container infrastructure
  (such as Kubernetes).

  * A Docker image that runs MAX Engine as a backend for NVIDIA Triton
    Inference Server.

* **Mojo**: The world's first programming language built from the ground-up for
  AI
  developers, with cutting-edge compiler technology that delivers unparalleled
  performance and programmability for any hardware.

  * The latest version of Mojo, the standard library, and the `mojo` command
    line tool. These are always included in MAX, so you don't need to download
    any separate packages.

  * The Mojo changes in each release are often quite long, so we're going to
    continue sharing those in the existing
    [Mojo changelog](https://mojolang.org/releases).

Additionally, we've started a new [GitHub repo for
MAX](https://github.com/modular/max), where we currently share a bunch of
code examples for our API libraries, including some large model pipelines.
You can also use this repo to [report issues with
MAX](https://github.com/modular/modular/issues/new/choose).

### Model Architecture Support

* Added support for the following model architectures:

  * `OlmoForCausalLM` (such as `allenai/OLMo-1B-0724-hf`)
  * `GraniteForCausalLM` (such as `ibm-granite/granite-3.1-8b-instruct`)
  * `Phi3ForCausalLM` (for Microsoft Phi-3 models)
  * `Qwen2ForCausalLM` (such as Qwen2 models)

  Example usage:

  ```sh
  max-pipelines generate \
    --model-path allenai/OLMo-1B-0724-hf \
    --prompt "Write bubble sort in mojo"
  ```

* The `max.pipelines.dataprocessing.tokenizer` and
  `max.pipelines.dataprocessing.gguf_utils` modules have been removed.

* The previously deprecated `PipelineConfig.architecture` field and its
  corresponding `--architecture` CLI argument have been removed.

### `max-pipelines` CLI

* The `--devices` CLI argument now supports a comma-separated list of GPU IDs
  prefixed with `gpu:` like `--devices=gpu:0,1,2,3`. We no longer support the
  previous `--devices=gpu-<N>` format.

  ```sh
  max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
    --quantization-encoding bfloat16 \
    --devices gpu:0,1,2,3 \
    --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
  ```

* Removed `--huggingface-repo-id` PipelineConfig option and CLI argument in
  favor
  of `--model-path`.

* Consolidated `-model-path` and `-weight-path`. If valid `-weight-path`(s) are
  provided, they'll now override `--model-path`, which in turn handles both local
  and remote (Hugging Face) cases. If we cannot derive the weights from the
  `--weight-path`(s), we'll now fall back to the `--model-path`, which has to be
  set explicitly by the user.

* Added `--huggingface-revision` option, to allow selecting a non-default branch
  or a specific commit in a Hugging Face model repository.
