v26.1 (2026-01-29)
Highlights
The eager-style Tensor and
Module APIs are
now the primary API for model development, providing a PyTorch-like development
experience:
from max import functional as F
from max.tensor import Tensor
from max.dtype import DType
x = Tensor.constant([1.0, -2.0, 3.0, -4.0, 5.0], dtype=DType.float16)
y = F.relu(x)
print(y)
# Tensor([1 0 3 0 5], dtype=DType.float16, device=Device(type=gpu,id=0))If you want explicit control over the graph structure, you can
still build models with the
Graph APIs.
For more details, see the model developer guide.
Documentation
-
The fully refactored MAX LLM book is now designed so the code you write in each exercise incrementally builds upon the last one, until you've built an executable GPT-2 model with the MAX Python API.
-
New model developer guide introduces eager-style programming, tensor APIs, and data types. Much more is coming soon.
-
New guide to profile MAX on GPUs with
nsys. -
Extended documentation for
kbench, a Python tool to benchmark, autotune, and analyze MAX kernel performance.
MAX models
-
Gemma3 now supports vision input (multimodal) in the 12B and 27B variants, including support for local file paths and structured output. Learn more in the image to text guide.
-
Added
Qwen/Qwen3-VL-4B-InstructandQwen/Qwen3-VL-2B-Instructmodel architectures. -
Removed Llama 3.2 Vision (
Llama-3.2-11B-Vision-Instruct) architecture support. Use other vision models such as Pixtral, InternVL, Qwen2.5-VL, and Gemma3.
MAX framework
- All Python wheels are now hosted at
https://whl.modular.com/nightly/simple/. If usinguv, change--index-urlto--index, and if usingpip, change to--extra-index-url. For precise commands, see the install guide.
Inference server
-
Improved scheduling to achieve higher KVCache utilization and batch sizes. By default, MAX now schedules a context encoding (CE) request only if KVCache memory is less than 95% full after allocating blocks for that request or if no active requests exist. You can adjust this watermark value (
0.95) with--kvcache-ce-watermark. Beware that increasing it causes more preemptions. -
When running models with data-parallelism (DP), the semantics of max batch size has changed. For example, when specifying
--data-parallel-degree 8and--max-batch-size 32it previously meant that each data-parallel replica could have at most 4 requests for an aggregate max batch size of 32. We changed this so that now the CLI flag specifies the max batch size per replica. This means the aggregate max batch size of the above values is 8*32=256 requests. This aligns with vLLM and other inference engines. -
--max-ce-batch-sizeis now deprecated. The cap on batch size is now uniform between context encoding and token generation phases of text generation. Use--max-batch-sizeinstead. -
The API server now returns chunked tokens from the model worker, reducing overhead and significantly improving throughput for small models and decode-heavy workloads.
-
Server stats collection (
collect_server_stats) is now enabled by default for serving benchmarks.
max CLI
-
The
max generatecommand now applies the model's chat template internally when using--prompt. This more closely aligns with how users typically prompt a model for testing and ensures special tokens are properly filtered from output. -
Added tracing flags to
max benchmarkfornsysprofiling:--trace: Enable tracing of the benchmark run (currently NVIDIA GPUs only)--trace-file: Path to save the trace file--trace-session: Optional session name for tracing
Requires the server to be run under
nsys launch. Using--gpu-profiling detailedis recommended.
Python API
-
The eager-style
TensorAPIs are now the primary API for model development, providing a PyTorch-like development experience.We moved the eager-style tensor APIs out of
experimentaland reorganized themax.nnmodule to make the eager module system the primary API (nn.module_v3is nownn.module).The previous
max.nncomponents are still available for backward compatibility inmax.nn.legacy. -
Renamed
max.driver.Tensortomax.driver.Bufferto clarify that it represents a low-level memory buffer, not a tensor. Themax.tensor.Tensorclass remains the primary tensor type. -
Added
forward()method toModuleto compute the output—it behaves the same as invoking the object as a callable (the__call__()method). -
accelerator_count()now returns a non-zero value when called on an Apple silicon system. This means you can use this code:device = CPU() if accelerator_count() == 0 else Accelerator()And it defaults to using the available Apple silicon GPU. As a consequence, MAX graphs should in most cases be dispatched to run on Apple silicon GPUs. Note that most MAX models do not yet work on Apple silicon GPUs due to missing hardware-specific kernel pathways and other support, but this is an important step towards enabling MAX more broadly on Apple silicon GPUs.
-
Added
max.nn.module.ropecontaining rotary embedding implementations,RotaryEmbeddingandTransposedRotaryEmbedding. -
Added
ArchConfigandArchConfigWithKVCache. Going forward, models that register with the MAX architecture registry must define a config that implements this protocol -
Added
ops.complex.mulfor multiplying complex-valued tensors -
Added
calculate_virtual_device_count(),calculate_virtual_device_count_from_cli(),load_max_buffer()tomax.driver. -
Added
TokenBufferfor token management. -
Renamed
prefill_chunk_sizetomax_batch_input_tokensandmax_batch_context_lengthtomax_batch_total_tokensinPipelineConfigandTTSConfigclasses to better reflect their purpose in batch memory management.The corresponding CLI flags have also been renamed:
--prefill-chunk-sizeis now--max-batch-input-tokensand--max-batch-context-lengthis now--max-batch-total-tokens. -
Fixed
max.driver.Buffer.to(stream)to not copy (it return reference to the same tensor) when the stream is on the same device, even for GPU-pinned host memory. -
Removed deprecated
max.nnconvolution classes:Conv2dV1,Conv1DV1,Conv3DV1. UseConv2d,Conv1D,Conv3Dinstead. -
Removed deprecated
max.nnlayer classes:LinearV1,QLinearV1,GPTQLinearV1,MLPV1,EmbeddingV1,LayerNormV1,RMSNormV1. UseLinear,GPTQLinear,MLP,Embedding,LayerNorm,RMSNorminstead. -
Removed
max.engine.MojoValue -
Removed the deprecated
custom_ops_pathparameter fromInferenceSession.load(). Instead use thecustom_extensionsparameter. -
Added
graph.ops.shard_and_stack() -
Removed unused
graph.weights.PytorchWeights
MAX kernels
-
Improved performance for Hopper Matmul when using skinny M shapes. In particular when M is between 2 and 64, we see a significant performance boost for specific shapes ranging between 10 - 40%.
-
Added swapAB optimization to Hopper Matmul, performs B x A and does a transposed write to C. This helps when you need more granularity in the M dimension.
-
Refined
create_streamAPI: all streams are now non-blocking (blockingargument has been removed). Explicitly useDeviceEventandsynchronize()wherever necessary.
Mojo language
For all the updates to the Mojo language, standard library, and tools,
including all GPU programming and Layout/LayoutTensor changes, see the Mojo
changelog
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!