For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Nightly (v26.4)
This version is still a work in progress.
MAX modelsβ
- Added support for the Tencent Hunyuan Hy3-preview (
HYV3ForCausalLM) architecture: a decoder-only mixture-of-experts model (192 routed experts, top-8 plus one shared expert) with sigmoid plus correction-bias routing, per-head query/key RMSNorm, and split-half RoPE. Runs multi-GPU with tensor-parallel attention and expert-parallel MoE. - Added NVFP4 quantization support for Gemma 4.
- Gemma 4 can now run native FP8 attention with an FP8 KV cache on B200
(SM100): Q, K, and V are
float8_e4m3fnread directly from the paged cache, and both Q@K^T and P@V execute as raw FP8 matmuls at tensorwise scale = 1 (no per-block scales, no dequantization staging) with a bf16 attention output. This roughly matches bf16 accuracy while improving decode throughput and roughly doubling KV cache capacity at the same memory. - Added MXFP4 quantization support for MiniMax-M2.
- Added tensor-parallel attention + expert-parallel MoE (TP+EP) support for
MiniMax-M2. Set
data_parallel_degree: 1withruntime.ep_size > 1to shard attention heads across GPUs while distributing MoE experts via expert parallelism. Both reduce-scatter (default) and allreduce (runtime.ep_use_allreduce: true) collective strategies are supported. - Kimi K2.5 tool calling now supports interleaved thinking: a single
assistant turn may interleave multiple
<think>...</think>reasoning blocks with multiple tool-call sections and end with<|im_end|>. The constrained-decoding grammar (used fortool_choiceand JSONresponse_format) admits up to eight tool-call sections with an optional reasoning block before each, and lets the model stop before the cap. This fixes atool_choice=autofailure where a second tool-call section disabled grammar enforcement for the rest of the request.
MAX frameworkβ
Inference serverβ
-
Chat completion responses now emit reasoning only under
reasoning, aligning with OpenAI's Responses API naming. Thereasoning_contentalias (previously emitted alongsidereasoningfor compatibility with vLLM, SGLang, and the DeepSeek API) is no longer included in responses. vLLM has deprecatedreasoning_contentin favor ofreasoning; see https://github.com/vllm-project/vllm/pull/33402. Clients should read chain-of-thought tokens from thereasoningfield. -
response_formatJSON schemas with a non-object root are now accepted when the roottypeis missing (any) or a type union that includesobject(for example{"type": ["object", "array", "string"]}); these are valid JSON Schema and compile to a constraining grammar. A root pinned to a single non-object type (for example{"type": "string"}) is still rejected, matching OpenAI's structured-outputs contract. -
Added a per-phase startup breakdown to the
maxserve.model_load_timePrometheus histogram (milliseconds), previously only available in the server logs. In addition to the existing untagged model-load aggregate, the model worker now records each startup phase on the same metric split by acomponenttag (build,compile,init,graph_capture,pinned_memory,spawn, andtotal), so a single metric can be plotted broken down by startup phase to track pod startup time in production. This replaces themaxserve.startup_timehistogram (seconds) added earlier in this nightly cycle. -
Added a
maxserve.time_per_output_tokenPrometheus histogram (milliseconds). Emitted once per request, it reports the mean decode-phase latency per generated token (decode_time / (num_generated_tokens - 1)), excluding the first token and prefill time. Because the denominator counts the tokens the model actually produced, the metric accounts for speculative decoding. -
The
maxserve.batch_sizePrometheus histogram is now labeled bybatch_type(CEfor prefill,TGfor decode), so the token-generation (decode) batch size can be observed separately from prefill. For the prefill token-count view, usemaxserve.batch_input_tokens(also labeled bybatch_type). Existing aggregate queries overmaxserve.batch_sizecontinue to work; selectors that pin a single series now gain thebatch_typedimension. -
Added Prometheus metrics for the API-server ingress backlog: requests accepted by the API server but not yet handed off to the model worker (still API-side, for example in tokenization).
maxserve.num_requests_awaiting_admissionis an up/down counter with the live value (incremented on arrival, decremented at handoff), andmaxserve.requests_awaiting_admissionis a companion histogram that captures the distribution / tail (p50/p99) over time. A persistently high value points at a backlog in the API server rather than in the scheduler queue (the latter is visible viamaxserve.num_requests_queued). -
Added Prometheus metrics for the egress (response) path, which show whether the API server is shipping tokens back to clients slower than the model produces them:
maxserve.num_responses_buffered(a gauge sampling the total model-worker responses received but not yet streamed to clients) with a companionmaxserve.responses_buffereddistribution histogram, andmaxserve.response_queue_time(a millisecond histogram of how long a response waits in the API server's per-request output queue before the streaming layer consumes it). Together they surface API-side egress bottlenecks (detokenization, serialization, slow clients) and the associated unbounded-output-queue memory growth. -
MAX Serve now returns a clearer 400 Bad Request with the underlying message when a prompt is too long for the model, instead of a generic "Value error." response (or, for streaming completions, a 500 Internal Server Error). All architectures now raise a structured
PromptTooLongErrorexposingnum_tokensandmax_lengthattributes so callers can handle the failure programmatically. The user-facing message identifies the relevant limit (LLM context window vs. diffusion text encoder sequence length): for example, "Prompt is too long: N tokens exceeds the configured maximum context length of M tokens. Please shorten your prompt." -
Fixed an FP8 dynamic-quantization bug that mis-quantized near-zero groups on NVIDIA GPUs (writing NaN into FP8 activations and the FP8 KV cache, surfacing downstream as non-finite logits). When a quantization group was near zero, its dynamic scale
max_abs / fp8_maxunderflowed to a tiny denormal whose reciprocal overflowed to infinity; multiplying lanes by that infinity produced+inf(and0 * inf = NaNon zero lanes) before the FP8 cast. This is upstream of, and not addressed by, the saturating FP8 cast: clamping the result would turn the near-zero group intoΒ±max_finitegarbage rather than the correct zero. The reciprocal is now guarded to be finite, so a near-zero group quantizes to a clean FP8 zero. Fixes the shared dynamic-scale helper (used by FP8 quantization, fused RMSNorm, and the residual-add AllReduce RMSNorm) and the fused RoPE plus KV-store path. -
Fixed a KV cache offloading correctness bug that corrupted output for multi-cache models (such as Gemma 4's interleaved sliding-window plus global attention) when the
localortieredKV connector was enabled. These models share one block pool across all of their caches, but the connector only offloaded and reloaded the primary cache, so a prefix-cache block served from host or disk restored only the primary cache's data and left the other caches' halves stale, degrading accuracy. The connector now offloads and restores every cache. -
Fixed JSON
response_formatand tool-call grammars not being enforced for Kimi K2.5 vision-language checkpoints. The Kimi K2.5 tokenizer did not carry grammar enforcement state onto the request context, so constrained-decoding requests fell back to an unenforced state and decoded freely (e.g. aresponse_format=json_schemarequest returned prose instead of schema-conformant JSON). The tokenizer now derives enforcement state from the response format, matching the text tokenizers. -
Fixed an intermittent constrained-decoding correctness bug under EAGLE speculative decoding. On the first decode step after a prefill (and after any batch that did not verify draft tokens), the speculative token bitmask was built from placeholder draft tokens instead of the real drafts being verified, leaving the bonus and later speculative slots unconstrained. A grammar-illegal token could then be sampled and committed, producing occasional JSON
response_formator tool-call grammar violations. The bitmask is now built from the realized drafts. -
MAX Serve now accepts
role: "developer"on/v1/chat/completions, normalizing it tosystemat the OpenAI-compat route layer. The OpenAI o1/o3 chat-completion spec usesdeveloperin place ofsystem, and recent OpenAI SDKs emit it by default. The previous behavior rejected the request with a 422 (literal_erroron the message role). -
Fixed
CreateChatCompletionRequestrejecting explicitnullvalues for optional fields such astool_choice,tools, andresponse_format. OpenAI-compatible clients (LangChain, JS SDKs, anything that serializes a dataclass with aNonefield) that emit"tool_choice": nullinstead of omitting the key are now accepted, matching the behavior of other OpenAI-compatible inference servers. -
Added two opt-in server flags for accepting OpenAI-compatible requests that the strict default behavior would reject:
-
--allow-unsupported-logprobs: when a request asks forlogprobsagainst a runtime that cannot honor them (today, the overlap scheduler), MAX Serve logs a warning and serves the request without logprobs instead of returning a400. -
--allow-extra-request-fields: unknown top-level fields on/v1/chat/completionsand/v1/completionsrequest bodies are dropped (with a warning) before pydantic validation, instead of returning a400. Useful when an upstream proxy sends vendor-specific fields that MAX Serve does not need to honor.
Both flags default to
False; the existing strict behavior is unchanged. The corresponding400error messages now reference the new flags. As a side effect, the legacy/v1/completionsroute now surfacesInputErrordetail strings to the client instead of the generic"Value error."message. -
-
MAX Serve now emits the
maxserve.num_requests_queuedOTel/Prometheus metric (changed from anUpDownCounterto a synchronousGauge). The gauge is sampled once per scheduler iteration fromBatchMetrics.publish_metricsand reports the depth of the scheduler's CE / prefill queue (the same value as thePending: N reqsline in scheduler logs). It is published by every text-path scheduler that drivesBatchMetrics:TokenGenerationSchedulerandPrefillScheduler(viaTextBatchConstructor), andDecodeScheduler(vialen(pending_reqs) + len(prefill_reqs)). Operators can use this metric to observe queue buildup during overload conditions. -
Added a
"none"option forruntime.tool_parserandruntime.reasoning_parserinPipelineConfig(CLI flags--tool-parserand--reasoning-parser). Passnone(case-insensitive) to explicitly disable the parser, overriding any architecture-declared default. Leaving the field unset still applies the architecture default as before. -
Added the
nemotron-opencodebenchmark dataset backed bynvidia/Nemotron-SFT-OpenCode-v1. Each row is a full Qwen3-Coder OpenCode trace (system prompt, multi-turn user/assistant/tool messages, and tool schemas). Multi-GB per subset, so the loader streams viadatasets.load_dataset(..., streaming=True)and pulls only enough rows to satisfy--num-prompts. Tool definitions per row are surfaced onNemotronOpenCodeBenchmarkDataset.last_loaded_tool_schemasand (for single-turn) attached toSampledRequest.tools. -
Benchmark request payloads now forward an OpenAI-style
tools=[...]field on chat-completions requests.SampledRequestandRequestFuncInputgained atools: list[dict] | None = Nonefield;OpenAIChatCompletionsRequestDriverserialises it into the POST body when set. Datasets that supply per-row tool schemas (currentlynemotron-opencode) now exercise the server's tool-call grammar / structured-output path end-to-end. Passenable_tool_calls=Falseon Nemotron-OpenCode to suppress forwarding. -
Removed multi-step decode from the text-generation pipelines. The flag
--max-num-stepsno longer works.
max CLIβ
- The serving benchmark now reports a per-turn KV cache retention percentile
metric for multi-turn workloads. For each turn after the first, it compares
the server-reported cached prefix against the block-aligned prefix carried
over from the previous turn, surfacing when cached tokens are dropped between
turns (distinct from the existing cached-token-rate metrics, whose denominator
includes new and uncacheable tokens). The KV cache block size used to align
the expected prefix is configurable via
--kv-block-size(default128); match it to the server's--kv-cache-page-size. - Added
--devices=gpu:allto use every visible GPU (including MAX Serve). - Removed the
defaultvalue for--devices; omit--devicesto use the model or config default. - The serving benchmark entrypoint (
benchmark_serving) now defaults--seedto a fixed value instead of drawing a fresh random seed on each run. The seed drives the workload generator (input/output lengths, session structure, content), so a fixed default makes repeated and scheduled runs reproducible and keeps run-to-run deltas reflecting the change under test rather than workload-draw variance. To opt back into a fresh seed, pass--seed noneon the CLI (orseed: nullin a workload/config YAML); the drawn seed is logged and recorded with the results so the run stays reproducible after the fact. - Added
--profiletomax pipelines generatefor rudimentary, one-command profiling. With Nsight Systems (nsys) onPATHand an NVIDIA GPU, the timed run is captured into an.nsys-repfile and a ranked top-N GPU kernel summary is printed. Withoutnsys, a Python/CPU profile is produced fromcProfile. The capture window is bounded bycudaProfilerStart/Stopso warmup and graph-compile time are excluded. Use--profile-outputto override the report path. - Added
--profiletomax pipelines benchmarkas a synonym for--tracethat also prints a ranked top-N GPU kernel summary at the end of the run. The server still needs to be launched undernsys launch(matching the existing--tracerequirement);--profileremoves the "now runnsys statsby hand" step.
Python APIβ
-
Reduced default signal buffer size from 1025 to 257 MiB per GPU and fixed miscalculation of required space in
MOGGKernelAPI.mojo. Calculation was wrong by a factor of1/num_devicessince each device only needs scratch for its own portion of the collective problem. Reduces footprint for current heaviest workload (Kimi-K2.5 withBlockCopyEngine) from 16GB to 4GB. -
Added
max.driver.CompletionFlag, an 8-byte completion flag in pinned host memory mapped into a device's address space. Lets host code signal a GPU stream (or peer host observer) by writing a 64-bit value to a single location visible to both. Currently CUDA-only; constructing against any other backend raisesRuntimeError. -
Added
Device.__unsafe_enqueue_async_py_host_func(fn, flag, value, cpu)andDeviceStream.wait_for_host_value(flag, value)for dispatching a Python callable onto an explicit AsyncRT worker pool from a host-function node and gating the GPU stream on its completion (via theCompletionFlag). The kickoff trampoline returns immediately, letting the GPU stream proceed concurrently with the worker; a downstreamwait_for_host_valueblocks the stream until the worker storesvalue. The__unsafe_prefix marks that the API has no safety net for callbacks that capture state outliving the compiled graph. -
Added the
mo.wait_host_valuegraph op and themax.nn.kernels.wait_host_value()Python helper that wraps it. Stalls the device stream until a 64-bit host-visible flag reaches a given value; lowers to CUDA'scuStreamWaitValue64and captures cleanly into a CUDA graph as a wait-value node. Lets a captured forward graph gate a downstream consumer kernel on CPU-produced data while the rest of the forward body runs concurrently. Pair withmo.launch_host_funcorDevice.__unsafe_enqueue_async_py_host_functo issue the host work whose completion the consumer waits on. -
Added two new nanobind types to
max._core.enginethat split the compile-and-load pipeline at the type level:CompiledModelsrepresents the compile artifact returned bycompile_from_path/compile_from_objecton themax._core.engine.InferenceSessionbinding (these methods don't exist on the publicmax.engine.InferenceSessionclass). It holds the MEF bytes and one or more sub-models; it is not directly executable.ModelMetadataexposes per-sub-model metadata (name,input_metadata,output_metadata) and is yielded by iterating aCompiledModelsor indexing it with[i].
Modelcontinues to represent the runnable, post-init handle (still produced byInferenceSession._load_all). The high-levelmax.engine.CompiledModelwrapper now holds aCompiledModelsinstance internally. -
Increased the default allreduce signal buffer size from 513 MiB to 1025 MiB per GPU (
max.nn.comm.allreduce.Signals.NUM_BYTESand the matching constant inmax.experimental.realization_context). The previous 512 MiB scratch could not hold the per-peer allgather intermediate for models with large hidden dimensions (for example, Kimi-K2.5 athidden_dim=20480withmax-batch-input-tokens=16384needs 640 MiB in bf16). This adds ~512 MiB of per-GPU memory use for any multi-GPU model. -
Added
max.experimental.functional.ceil, an element-wise unary op that rounds each element of a floating-point tensor up toward positive infinity. Complements the existingfloor,round, andtruncops. -
max.experimental.functional.while_loopnow passesTensor(notTensorValue) into itspredicateandbodycallbacks. Callbacks can use ordinaryTensoroperations directly, without wrapping arguments viaTensor.from_graph_value(...)or reaching for the underscore-prefixed_graph_valueattribute on returns. -
max.experimental.nn.Module.compile()now emits the sameBuilding and compiling {ClassName}... / Still building... / Building {ClassName} graph took Ns / Compiling {ClassName} took Ms / Building and compiling {ClassName} took Tslog sequence that pipeline-levelCompilationTimerproduces today, and wraps the compile body inmax.profiler.Tracerspans (Module.compile({ClassName}),Module.compile.trace,Module.compile.session_load) so annsyscapture withMODULAR_ENABLE_PROFILING=1shows compilation as named ranges. Every ModuleV3 caller β including pixel-generation pipelines that previously compiled silently β now gets this observability for free. The outerCompilationTimer("model")wrappers in*_modulev3architectures have been removed to avoid nested timing logs. -
max.experimental.nn.Module.load_state_dictandModule.compile(weights=...)now accept anauto_castkeyword (defaultFalse). The framework remains strict by default. Whenauto_cast=Trueis passed, loaded weights are automatically cast betweenfloat32andbfloat16when shapes match, logging a single summary message per load instead of raising. Other dtype mismatches (float16,fp8,fp4, integers, etc.) continue to raise as before. This removes the need for per-adapterastypeshims when checkpoint dtypes differ from the module's declared parameter dtype. MAX pipelines opt in via theMODULAR_AUTO_CAST_WEIGHTSenvironment variable (defaulttrue, parsed bymax.pipelines.lib.weight_loading.auto_cast_weights_from_env). -
CPUMetricsCollectorinmax.diagnostics.cpuis now used as a context manager instead ofstart/stopand now exposesget_stats()instead ofdump_stats(), matching the interface ofGPUDiagContext. -
max.graph.Moduleis now a public class for grouping multipleGraphinstances into a single compilation unit, replacing the previous alias for the underlying MLIR module. Construct one withModule()and pass it as themodule=argument to eachGraph; the resultingModuleis what you hand toInferenceSession.load_allto compile every graph together.Graph.empty_module()has been removed in favor ofModule(), andGraphnow exposes amoduleproperty returning theModuleit belongs to. -
InferenceSession.load_allnow returns adict[str, Model]keyed by each model'ssym_name(the name of itsmo.graphop), instead of alist[Model]ordered by MEF position. The accepted input type also gainedmax.graph.Module, so callers can compile a pre-built module containing multiplemo.graphops directly.Modelnow exposes anameproperty.Migrate positional unpacking call sites by indexing the returned dict:
# Before module = Graph.empty_module() with Graph("vision", input_types=..., module=module): ... with Graph("language", input_types=..., module=module): ... vision_model, language_model = session.load_all(graph, ...) # After module = Module() with Graph("vision", input_types=..., module=module) as vision_graph: ... with Graph("language", input_types=..., module=module) as language_graph: ... models = session.load_all(module, ...) vision_model = models[vision_graph.name] language_model = models[language_graph.name]
MAX kernelsβ
- The
use_blocking_implparameter has been removed from theforeachcustom op helper (and the underlyingelementwiseprimitive), and the analogoussingle_thread_blocking_overrideparameter has been removed from theconcatandconcat_shapekernels and the reduction-based kernels. Work is always dispatched the same way, with a single worker used automatically when the problem size is small. The dedicated small-tensorconcatfast path has been removed in favor of the existing serial/parallel dispatch. - Updated
elementwisecall sites across MAX kernels and benchmarks to useCoord-native indexing, fixing compile failures caused by invalidCoord/IndexListconversions. - Enabled Programmatic Dependent Launch (PDL) for the SM100 (Blackwell)
FlashAttention-4 prefill kernel, letting back-to-back attention grids in a
stream overlap launch and prologue latency. This reduces per-launch overhead
most for shorter sequences (measured ~1.05xβ1.5x faster on B200, bf16,
head_dim=128 across seq lengths 128β2048). On by default; disable with
-D MHA_PDL=false. - Added a simdgroup-tiled matmul kernel for the Apple M5 GPU, bringing
neural-accelerator-backed matmul to the MAX framework. In-range MAX matmuls
(
m >= 64,n >= 64,k >= 16; ragged K supported) now use it: fp16/bf16 always, and fp32 a/b by default (accepting the simdgroup MMA's fp19 truncation). SetMODULAR_APPLE_M5_ALLOW_LOSSY_F32_MATMUL=0for the precise naive fp32 path.
Breaking changesβ
-
KV cache management has moved from
max.kv_cachetomax.pipelines.kv_cache. Update imports accordingly:# Before from max.kv_cache import PagedKVCacheManager, DummyKVCache # After from max.pipelines.kv_cache import PagedKVCacheManager, DummyKVCacheDeprecation shims with
DeprecationWarningremain at the old path. -
Custom Mojo ops used through
max.experimental.torch.CustomOpLibrary(and the rest of the graph-compiler custom-op path) must now declare theirctxparameter asDeviceContextinstead ofDeviceContextPtr. TheDeviceContextPtrtype has been removed from the Mojo standard library; see the Mojo nightly changelog entry under Removed for the full migration. Multi-device ops should declare their variadic context argument asDeviceContextList[N](also new β see the Mojo changelog GPU programming section). -
GPU and CPU diagnostic tooling has moved from
max.diagnosticstomax.profiler:max.diagnostics.gpuβmax.profiler.gpuandmax.diagnostics.cpuβmax.profiler.cpu. Update imports accordingly. Deprecation shims withDeprecationWarningremain at the old paths. -
max/python/max/benchmark/benchmark_throughput.py, deprecated in v0.26.3, has been removed.
Fixesβ
-
Fixed structured output (
response_format: json_schemaand grammar-guided tool calling) intermittently emitting raw control characters inside JSON string values on models that use a byte-level BPE (TikToken) tokenizer, producing invalid JSON. The constrained-decoding adapter fed llguidance the tokens' byte->unicode surface bytes (e.g. a raw newline rendered asΔ) instead of their true bytes, so the grammar mask admitted control-char tokens as legal string content. Token bytes are now recovered via the tokenizer'sbyte_decoder, so raw control characters are correctly excluded. Fast-tokenizer checkpoints were unaffected. -
Fixed an expert-parallelism dispatch assertion (
Cannot dispatch EP kernel with N input tokens when the maximum tokens per rank is N-1) that fired whenever--max-batch-input-tokenswas not evenly divisible by the tensor-parallel degree. The EP per-rank cap now uses ceiling division to match the ragged binning ofreducescatterin TP-attention + EP-MoE mode, so the largest shard fits in the dispatch buffer. Affects DeepSeek-V3, Kimi-K2.5, MiniMax-M2, Qwen3, and Step3.5 deployments configured with non-divisible batch sizes. -
MODULAR_DEBUG=ir-output-dir=<dir>(and the equivalent[max-debug] ir-output-dir = <dir>config-file entry andInferenceSession.debug.ir_output_dir = <dir>Python setter) now actually dumps per-stage MLIR files to the configured directory. The option was previously parsed but no compiler stage consulted it, so users had to fall back to the legacyMODULAR_MAX_TEMPS_DIRenv var. Both spellings are now honored.
Mojo languageβ
For all the updates to the Mojo language, standard library, and tools, see the Mojo release notes.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!