v25.6 (2025-09-22)
Highlights
-
MAX delivers state-of-the-art performance on NVIDIA Blackwell (B200)!
We've been describing our Blackwell bring-up over a series of blog posts, and we recently published Part 4: Breaking SOTA, in which we share our latest matmul benchmarks compared to NVIDIA's cuBLAS library.
-
MAX provides industry-leading performance on AMD MI355X!
In a matter of weeks, we got MAX running on the brand new MI255X system and have already produced early benchmarks that go head-to-head with Blackwell. If you have access to an MI355X, you can try it yourself today by following our quickstart guide.
-
Benchmarking endpoints is easier than ever before the new
max benchmarkcommand, which accepts YAML configuration files so you can easily share and reproduce your benchmarks.
Documentation
-
Our new quickstart guide lets you pick the model architecture and size you want, and then shows you how to deploy it and run our open-source benchmarking script, all from the
maxCLI. -
We updated and simplified the benchmarking tutorial to use the new
max benchmarkcommand.
MAX models
- Added the gpt-oss model architecture (GPU, bfloat16). Try GPT-OSS now.
MAX framework
-
Added device-aware work scheduling for AsyncRT: work items can now specify a
deviceHintto route execution to specific worker threads based on device affinity, improving multi-device performance. -
Improved code quality by enabling large set of RUFF lints, including flake8-annotations (ANN) which now enforces Python type annotations for new contributions.
Inference server
-
Added support for data parallelism in Llama models. To enable this feature, use the
--data-parallel-degreeoption:max serve --model $MODEL_ID --data-parallel-degree 2 --devices gpu:0,1 -
Metrics for each context encoding and token generation batch are now logged to the console periodically. We can override the default frequency (3 seconds) of such logs via setting the
MAX_SERVE_SCHEDULER_STATS_LOG_INTERVAL_Sflag. For example, settingMAX_SERVE_SCHEDULER_STATS_LOG_INTERVAL_S=0will log metrics for all batches. -
Improved error messages when pulling a model that requires more RAM than what's available or when there won't be enough RAM left for the KV cache.
max CLI
-
Added the
max benchmarksubcommand that runs a suite of benchmarks and collects performance metrics on a model server. This command provides convenient packaging/installation for our open-sourcebenchmark_serving.pyscript and accepts all the same options. -
Added
--chat-templateto the CLI for passing a custom chat templates defined in Jinja2 template files. -
Renamed the
--allow-safetensors-weights-float32-to-bfloat16-castflag to--allow-safetensors-weights-fp32-bf6-bidirectional-cast, which supports automatic bidirectional dtype casts when needed. -
The
max generatecommand now supports--top-k,--temperature, and--seedflags. -
Changed
--num-warmupsbehavior. Previously, it ran the model on the promptNtimes, generating until reaching a stop condition each time. Now it runs the model forNsteps, generatingNnew tokens as a warmup. -
Added the
--modeloption as a preferred alternative to--model-path. They behave the same. -
Deprecated
--pad-to-multiple-of. -
Removed the previously deprecated
--model-name. Use--served-model-nameinstead.
Python API
-
Removed the previously deprecated
KVCacheStrategy.CONTINUOUSand all associated classes (includingContinuousBatchingKVCacheManager). -
Added
ops.fence, a pure identity operation that prevents the async runtime from reordering operations across it. This operation is essential for implementing cross-device synchronization. -
Removed
PipelineConfig.max_new_tokens. UseSamplingParams.max_new_tokensinstead. -
Added
logits_processortoSamplingParamsfor updating logits in-place during each step of token generation. -
Added
generate()toTextGenerationPipelineandSpeculativeDecodingPipeline, a convenience method for getting text generations.generate_async()is available for getting streamed outputs. -
Renamed the
target_num_new_tokensconfiguration parameter toprefill_chunk_sizeinPipelineConfigandTTSConfigclasses to better reflect its role in chunked prefill operations. -
Fixed
ops.rangeto respect thedtypeparameter when usingDimobjects as inputs. Previously, the dtype was ignored and defaulted to int64. -
Made the
devicesargument inInferenceSession()required. To maintain the previous default behavior, useInferenceSession(devices=[CPU()]). -
Added an optional
loggingargument toInferenceSession(). When set to"op", this option enables operation launch output to stderr. -
Added
max.nn.lora, providing Low-Rank Adaptation (LoRA) support for parameter-efficient fine-tuning of neural network models. -
Added
max.nn.moe, implementing Mixture of Experts (MoE) layers for scalable model architectures. -
Added
max.nn.sampling, containing advanced sampling methods including MinP and rejection sampling techniques. -
Added
max.nn.hooks, providing debugging and inspection hooks for neural network layers. -
Added attention submodules
max.nn.attention.mask_config,max.nn.attention.multihead_attention, andmax.nn.attention.multi_latent_attentionfor comprehensive attention mechanism configuration and implementation. -
Moved some Mojo-related functionality to a new top-level
mojoPython namespace. Specifically,max.mojo(previously used for Mojo-Python interop), some ofmax.support, andmax.entrypoints.mojonow live under themojonamespace and are provided in the newmojopackage.
MAX kernels
-
Added a leaky ReLU activation function kernel.
-
Added a specialized RMS norm function kernel for the common case of
cols=128,bfloat16.
Mojo language
For all the updates to the Mojo language, standard library, and tools, including all GPU programming changes, see the Mojo changelog.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!