Skip to main content

v25.6 (2025-09-22)

Highlights

  • MAX delivers state-of-the-art performance on NVIDIA Blackwell (B200)!

    We've been describing our Blackwell bring-up over a series of blog posts, and we recently published Part 4: Breaking SOTA, in which we share our latest matmul benchmarks compared to NVIDIA's cuBLAS library.

  • MAX provides industry-leading performance on AMD MI355X!

    In a matter of weeks, we got MAX running on the brand new MI255X system and have already produced early benchmarks that go head-to-head with Blackwell. If you have access to an MI355X, you can try it yourself today by following our quickstart guide.

  • Benchmarking endpoints is easier than ever before the new max benchmark command, which accepts YAML configuration files so you can easily share and reproduce your benchmarks.

Documentation

  • Our new quickstart guide lets you pick the model architecture and size you want, and then shows you how to deploy it and run our open-source benchmarking script, all from the max CLI.

  • We updated and simplified the benchmarking tutorial to use the new max benchmark command.

MAX models

MAX framework

  • Added device-aware work scheduling for AsyncRT: work items can now specify a deviceHint to route execution to specific worker threads based on device affinity, improving multi-device performance.

  • Improved code quality by enabling large set of RUFF lints, including flake8-annotations (ANN) which now enforces Python type annotations for new contributions.

Inference server

  • Added support for data parallelism in Llama models. To enable this feature, use the --data-parallel-degree option:

    max serve --model $MODEL_ID --data-parallel-degree 2 --devices gpu:0,1
  • Metrics for each context encoding and token generation batch are now logged to the console periodically. We can override the default frequency (3 seconds) of such logs via setting the MAX_SERVE_SCHEDULER_STATS_LOG_INTERVAL_S flag. For example, setting MAX_SERVE_SCHEDULER_STATS_LOG_INTERVAL_S=0 will log metrics for all batches.

  • Improved error messages when pulling a model that requires more RAM than what's available or when there won't be enough RAM left for the KV cache.

max CLI

  • Added the max benchmark subcommand that runs a suite of benchmarks and collects performance metrics on a model server. This command provides convenient packaging/installation for our open-source benchmark_serving.py script and accepts all the same options.

  • Added --chat-template to the CLI for passing a custom chat templates defined in Jinja2 template files.

  • Renamed the --allow-safetensors-weights-float32-to-bfloat16-cast flag to --allow-safetensors-weights-fp32-bf6-bidirectional-cast, which supports automatic bidirectional dtype casts when needed.

  • The max generate command now supports --top-k, --temperature, and --seed flags.

  • Changed --num-warmups behavior. Previously, it ran the model on the prompt N times, generating until reaching a stop condition each time. Now it runs the model for N steps, generating N new tokens as a warmup.

  • Added the --model option as a preferred alternative to --model-path. They behave the same.

  • Deprecated --pad-to-multiple-of.

  • Removed the previously deprecated --model-name. Use --served-model-name instead.

Python API

  • Removed the previously deprecated KVCacheStrategy.CONTINUOUS and all associated classes (including ContinuousBatchingKVCacheManager).

  • Added ops.fence, a pure identity operation that prevents the async runtime from reordering operations across it. This operation is essential for implementing cross-device synchronization.

  • Removed PipelineConfig.max_new_tokens. Use SamplingParams.max_new_tokens instead.

  • Added logits_processor to SamplingParams for updating logits in-place during each step of token generation.

  • Added generate() to TextGenerationPipeline and SpeculativeDecodingPipeline, a convenience method for getting text generations. generate_async() is available for getting streamed outputs.

  • Renamed the target_num_new_tokens configuration parameter to prefill_chunk_size in PipelineConfig and TTSConfig classes to better reflect its role in chunked prefill operations.

  • Fixed ops.range to respect the dtype parameter when using Dim objects as inputs. Previously, the dtype was ignored and defaulted to int64.

  • Made the devices argument in InferenceSession() required. To maintain the previous default behavior, use InferenceSession(devices=[CPU()]).

  • Added an optional logging argument to InferenceSession(). When set to "op", this option enables operation launch output to stderr.

  • Added max.nn.lora, providing Low-Rank Adaptation (LoRA) support for parameter-efficient fine-tuning of neural network models.

  • Added max.nn.moe, implementing Mixture of Experts (MoE) layers for scalable model architectures.

  • Added max.nn.sampling, containing advanced sampling methods including MinP and rejection sampling techniques.

  • Added max.nn.hooks, providing debugging and inspection hooks for neural network layers.

  • Added attention submodules max.nn.attention.mask_config, max.nn.attention.multihead_attention, and max.nn.attention.multi_latent_attention for comprehensive attention mechanism configuration and implementation.

  • Moved some Mojo-related functionality to a new top-level mojo Python namespace. Specifically, max.mojo (previously used for Mojo-Python interop), some of max.support, and max.entrypoints.mojo now live under the mojo namespace and are provided in the new mojo package.

MAX kernels

  • Added a leaky ReLU activation function kernel.

  • Added a specialized RMS norm function kernel for the common case of cols=128, bfloat16.

Mojo language

For all the updates to the Mojo language, standard library, and tools, including all GPU programming changes, see the Mojo changelog.

Was this page helpful?