Nightly (v26.4)
This version is still a work in progress.
MAX modelsβ
- Added MXFP4 quantization support for MiniMax-M2.
MAX frameworkβ
Inference serverβ
- MAX Serve now emits the
maxserve.num_requests_queuedOTel/Prometheus metric (changed from anUpDownCounterto a synchronousGauge). The gauge is sampled once per scheduler iteration fromBatchMetrics.publish_metricsand reports the depth of the scheduler's CE / prefill queue (the same value as thePending: N reqsline in scheduler logs). It is published by every text-path scheduler that drivesBatchMetrics:TokenGenerationSchedulerandPrefillScheduler(viaTextBatchConstructor), andDecodeScheduler(vialen(pending_reqs) + len(prefill_reqs)). Operators can use this metric to observe queue buildup during overload conditions.
max CLIβ
- Added
--devices=gpu:allto use every visible GPU (including MAX Serve).
Python APIβ
CPUMetricsCollectorinmax.diagnostics.cpuis now used as a context manager instead ofstart/stopand now exposesget_stats()instead ofdump_stats(), matching the interface ofGPUDiagContext.
Breaking changesβ
max/python/max/benchmark/benchmark_throughput.py, deprecated in v0.26.3, has been removed.
Fixesβ
MODULAR_DEBUG=ir-output-dir=<dir>(and the equivalent[max-debug] ir-output-dir = <dir>config-file entry andInferenceSession.debug.ir_output_dir = <dir>Python setter) now actually dumps per-stage MLIR files to the configured directory. The option was previously parsed but no compiler stage consulted it, so users had to fall back to the legacyMODULAR_MAX_TEMPS_DIRenv var. Both spellings are now honored.
Mojo languageβ
For all the updates to the Mojo language, standard library, and tools, see the Mojo release notes.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!