Skip to main content

v26.1 (2026-01-29)

Highlights

The eager-style Tensor and Module APIs are now the primary API for model development, providing a PyTorch-like development experience:

from max import functional as F
from max.tensor import Tensor
from max.dtype import DType

x = Tensor.constant([1.0, -2.0, 3.0, -4.0, 5.0], dtype=DType.float16)
y = F.relu(x)
print(y)
# Tensor([1 0 3 0 5], dtype=DType.float16, device=Device(type=gpu,id=0))

If you want explicit control over the graph structure, you can still build models with the Graph APIs.

For more details, see the model developer guide.

Documentation

MAX models

  • Gemma3 now supports vision input (multimodal) in the 12B and 27B variants, including support for local file paths and structured output. Learn more in the image to text guide.

  • Added Qwen/Qwen3-VL-4B-Instruct and Qwen/Qwen3-VL-2B-Instruct model architectures.

  • Removed Llama 3.2 Vision (Llama-3.2-11B-Vision-Instruct) architecture support. Use other vision models such as Pixtral, InternVL, Qwen2.5-VL, and Gemma3.

MAX framework

  • All Python wheels are now hosted at https://whl.modular.com/nightly/simple/. If using uv, change --index-url to --index, and if using pip, change to --extra-index-url. For precise commands, see the install guide.

Inference server

  • Improved scheduling to achieve higher KVCache utilization and batch sizes. By default, MAX now schedules a context encoding (CE) request only if KVCache memory is less than 95% full after allocating blocks for that request or if no active requests exist. You can adjust this watermark value (0.95) with --kvcache-ce-watermark. Beware that increasing it causes more preemptions.

  • When running models with data-parallelism (DP), the semantics of max batch size has changed. For example, when specifying --data-parallel-degree 8 and --max-batch-size 32 it previously meant that each data-parallel replica could have at most 4 requests for an aggregate max batch size of 32. We changed this so that now the CLI flag specifies the max batch size per replica. This means the aggregate max batch size of the above values is 8*32=256 requests. This aligns with vLLM and other inference engines.

  • --max-ce-batch-size is now deprecated. The cap on batch size is now uniform between context encoding and token generation phases of text generation. Use --max-batch-size instead.

  • The API server now returns chunked tokens from the model worker, reducing overhead and significantly improving throughput for small models and decode-heavy workloads.

  • Server stats collection (collect_server_stats) is now enabled by default for serving benchmarks.

max CLI

  • The max generate command now applies the model's chat template internally when using --prompt. This more closely aligns with how users typically prompt a model for testing and ensures special tokens are properly filtered from output.

  • Added tracing flags to max benchmark for nsys profiling:

    • --trace: Enable tracing of the benchmark run (currently NVIDIA GPUs only)
    • --trace-file: Path to save the trace file
    • --trace-session: Optional session name for tracing

    Requires the server to be run under nsys launch. Using --gpu-profiling detailed is recommended.

Python API

  • The eager-style Tensor APIs are now the primary API for model development, providing a PyTorch-like development experience.

    We moved the eager-style tensor APIs out of experimental and reorganized the max.nn module to make the eager module system the primary API (nn.module_v3 is now nn.module).

    The previous max.nn components are still available for backward compatibility in max.nn.legacy.

  • Renamed max.driver.Tensor to max.driver.Buffer to clarify that it represents a low-level memory buffer, not a tensor. The max.tensor.Tensor class remains the primary tensor type.

  • Added forward() method to Module to compute the output—it behaves the same as invoking the object as a callable (the __call__() method).

  • accelerator_count() now returns a non-zero value when called on an Apple silicon system. This means you can use this code:

    device = CPU() if accelerator_count() == 0 else Accelerator()

    And it defaults to using the available Apple silicon GPU. As a consequence, MAX graphs should in most cases be dispatched to run on Apple silicon GPUs. Note that most MAX models do not yet work on Apple silicon GPUs due to missing hardware-specific kernel pathways and other support, but this is an important step towards enabling MAX more broadly on Apple silicon GPUs.

  • Added max.nn.module.rope containing rotary embedding implementations, RotaryEmbedding and TransposedRotaryEmbedding.

  • Added ArchConfig and ArchConfigWithKVCache. Going forward, models that register with the MAX architecture registry must define a config that implements this protocol

  • Added ops.complex.mul for multiplying complex-valued tensors

  • Added calculate_virtual_device_count(), calculate_virtual_device_count_from_cli(), load_max_buffer() to max.driver.

  • Added TokenBuffer for token management.

  • Renamed prefill_chunk_size to max_batch_input_tokens and max_batch_context_length to max_batch_total_tokens in PipelineConfig and TTSConfig classes to better reflect their purpose in batch memory management.

    The corresponding CLI flags have also been renamed: --prefill-chunk-size is now --max-batch-input-tokens and --max-batch-context-length is now --max-batch-total-tokens.

  • Fixed max.driver.Buffer.to(stream) to not copy (it return reference to the same tensor) when the stream is on the same device, even for GPU-pinned host memory.

  • Removed deprecated max.nn convolution classes: Conv2dV1, Conv1DV1, Conv3DV1. Use Conv2d, Conv1D, Conv3D instead.

  • Removed deprecated max.nn layer classes: LinearV1, QLinearV1, GPTQLinearV1, MLPV1, EmbeddingV1, LayerNormV1, RMSNormV1. Use Linear, GPTQLinear, MLP, Embedding, LayerNorm, RMSNorm instead.

  • Removed max.engine.MojoValue

  • Removed the deprecated custom_ops_path parameter from InferenceSession.load(). Instead use the custom_extensions parameter.

  • Added graph.ops.shard_and_stack()

  • Removed unused graph.weights.PytorchWeights

MAX kernels

  • Improved performance for Hopper Matmul when using skinny M shapes. In particular when M is between 2 and 64, we see a significant performance boost for specific shapes ranging between 10 - 40%.

  • Added swapAB optimization to Hopper Matmul, performs B x A and does a transposed write to C. This helps when you need more granularity in the M dimension.

  • Refined create_stream API: all streams are now non-blocking (blocking argument has been removed). Explicitly use DeviceEvent and synchronize() wherever necessary.

Mojo language

For all the updates to the Mojo language, standard library, and tools, including all GPU programming and Layout/LayoutTensor changes, see the Mojo changelog

Was this page helpful?