Skip to main content

v25.1 (2025-02-13)

✨ Highlights

  • Custom ops for GPUs

    Our new custom op API allows you to extend MAX Engine with new graph operations written in Mojo that execute on either CPU or GPU, providing full composability and extensibility for your models. See more in the section about GPU programming.

  • Enhanced support for agentic workflows

    MAX Serve now supports function calling, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.

    MAX Serve now supports structured output (also known as constrained decoding) for MAX models on GPU. This allows you to enforce the output format from a model using an input schema that defines the output structure. Learn more about structured output.

  • Extended model architecture support

  • New max-pipelines CLI tool

    Instead of cloning our GitHub repo to access our latest GenAI models, you can instead install the max-pipelines CLI tool and quickly run an inference or deploy an endpoint.

Documentation

New tutorials:

Other docs:

MAX Serve

  • The /v1/completions REST endpoint now supports:

    • Pre-tokenized prompts.

    • Image inputs for multimodal models such as Llama-3.2-11B-Vision-Instruct. For an example, see how to generate image descriptions with Llama 3.2 Vision.

      Known issue: You might receive faulty results because some parts of the text prompt get ignored for certain input combinations. We've identified the problem and will have a fix in a subsequent nightly release.

    • Function calling and tool use, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.

    • Structured output (also known as constrained decoding), which allows you to enforce the output format from a model using a JSON schema and the response_format field. To enable constrained decoding pass --enable-structured-output when running the server. However, this feature currently works for MAX models on GPU only (support for PyTorch models and CPU is in progress). Learn more about structured output.

  • Added support for the /v1/embeddings API endpoint, allowing you to generate vector representations using embedding models. See how to deploy a text embedding model.

  • Max Serve can evict requests when the number of available pages in the PagedAttention KVCache is limited. Before, the KV manager would throw an OOM error when a batch that cannot fit in the cache was scheduled.

MAX models

  • Added the max-pipelines CLI tool that simplifies the process to run inference with GenAI models (specified with a Hugging Face repo ID) and deploy them to a local endpoint with MAX Serve.

    Previously, running or serving these models required cloning the modular/max GitHub repo and then running commands such as magic run llama3.

    These model-specific commands like llama3 and replit commands have been removed. They're now standardized and subsumed by flags like --model-path in the max-pipelines tool. Arguments such as --max-length and --weight-path are also still supported by max-pipelines.

    To view a list of supported model architectures from Hugging Face, run max-pipelines list.

  • Added support for PagedAttention, which improves memory efficiency by partitioning the KV cache into smaller blocks, reducing fragmentation and enabling larger inference batches. You can enable it with --cache-strategy=paged and --kv-cache-page-size with a value that's a multiple of 128.

  • Added support for prefix caching in all cases where PagedAttention is supported. This allows for more efficient usage of KVCache and improved prefill performance for workloads with common prefixes. You can enable it by setting --enable-prefix-caching. For more information, see Prefix caching with PagedAttention.

  • Batch size and max length are now inferred from available memory and the HF Models' default values for max length, respectively. If a configuration leads to an OOM, then we provide recommendations (to the best of our ability) to the user to fit the model into memory.

  • Added support for heterogeneous KV caches for multi-modal models, such as Llama Vision, which cache different KV states for self and cross attention layers.

  • Added support for embedding models, starting with MPNet. For example:

    max-pipelines generate \
      --model-path=sentence-transformers/all-mpnet-base-v2 \
      --prompt="Encode this sentence."

    Also see how to deploy a text embedding model.

  • Added support for image and text multimodal models:

    • max-pipelines generate now accepts image input with --image_url.

    • Added an experimental Pixtral pipeline you can run as follows:

      max-pipelines generate \
        --model-path=mistral-community/pixtral-12b \
        --prompt="What is in this image? [IMG]" \
        --image_url=http://picsum.photos/1024/1024

      The pipeline is automatically used for all models implementing the LlavaForConditionalGeneration architecture.

      The implementation currently has a limit of one image. We plan support an arbitrary number of images of mixed sizes soon.

    • Added an experimental Llama Vision pipeline you can run as follows:

      max-pipelines generate \
        --model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \
        --prompt="<|image|><|begin_of_text|>What is in this image?" \
        --image_url=http://picsum.photos/1024/1024

      The pipeline is automatically used for all models implementing the MllamaForConditionalGeneration architecture.

      Note: This model is gated and requires that you set the HF_TOKEN environment variable. See Llama-3.2-11B-Vision-Instruct.

    • See how to generate image descriptions with Llama 3.2 Vision.

  • Added support for the Qwen2ForCausalLM model architecture (such as Qwen/Qwen2.5-7B-Instruct). For example:

    max-pipelines generate \
      --model-path=Qwen/Qwen2.5-7B-Instruct \
      --prompt="Write bubble sort in python" \
      --quantization-encoding bfloat16
  • Added support for offline batched inference for text-based LLMs, allowing you to load a model and run inference with a batch of inputs directly from Python, instead of relying on an HTTP interface. For an example, see examples/offline-inference/basic.py.

  • The --max-cache-batch-size flag has been deprecated in favor of --max-batch-size. Using --max-cache-batch-size now emits a deprecation warning and will stop working in a future release.

  • The --use-gpu flag has been deprecated in favor of --devices=cpu, --devices=gpu, or --devices=gpu-0,gpu-1,.... If the device isn't specified, the model runs on the first available GPU, or CPU if no GPUs are available.

MAX Engine

  • Improved internal kernel compilation speed 1.5 - 4X across different models.

    We've revamped our GPU compilation process so that all kernels in a program are compiled together into a single LLVM module, then split into separate kernels afterward. This ensures shared code between kernel entry points is only compiled once. For example, we observe a 3.7x speed up for Llama3.1-8b GPU startup time.

  • Improved initial model execution speed on NVIDIA GPUs.

    Instead of compiling to PTX and performing just-in-time compilation during runtime, we now generate CUBIN binaries directly. While this increases initial compilation time, it significantly improves execution speed.

  • The kernels have been further tuned for performance on NVIDIA A100 GPUs.

Graph APIs

  • You can now write custom operations (ops) in Mojo, and add them to a graph constructed in Python, using custom() and inplace_custom().

    For more detail, see the section below about GPU programming.

  • Cached compiled MAX graphs that make use of custom operations now get invalidated when the implementation of the custom operations change.

  • Graph.add_weight() now takes an explicit device argument. This enables explicitly passing GPU-resident weights to session.load() via the weights registry to initialize the model.

  • max.graph.Weight now inherits from TensorValue, allowing you to call weight.cast() or weight.T. As such, the TensorValue no longer accepts Weight for the value argument.

Pipeline APIs

  • TextTokenizer.new_context() now supports tool definitions passed through its request argument (via TokenGeneratorRequest.tools).

  • Removed the default num_steps value for TokenGenerator.next_token(), ensuring users pass a value, reducing the potential for silent errors.

  • KVCacheStrategy now defaults to MODEL_DEFAULT.

    As opposed to the previous setting which always used the "continuous" caching strategy, KV caching strategy is now defaulted on an architecture-specific basis to ensure the most optimized caching strategy is used.

  • The Linear layer now has a create() class method that automatically creates specializations of Linear for non-quantized, k-quant, or GPTQ layers.

  • Added nn.Conv1D for audio models like Whisper.

GPU programming

This release includes all new APIs to program on GPUs. The way to write code for GPUs is to create custom operations with GPU functions that you can load into a MAX graph. This foundational API includes a few key components:

  • Mojo APIs to write custom op functions:

    • The @compiler.register decorator is applied to a Mojo struct that implements a custom op in an execute() function—for either CPU or GPU—and a shape() function that defines the custom op's output tensor.

    • The max.tensor package adds essential Mojo APIs for writing custom ops, such as:

      • The foreach() function, which efficiently executes an element-wise computation in parallel on either a GPU or CPU.

      • The ManagedTensorSlice type defines the input and output tensors for the custom op.

  • Python APIs to load custom ops into a model:

    • The custom() and inplace_custom() functions allow you to add the previously-defined Mojo custom op to a MAX graph written in Python.

    • The InferenceSession constructor accepts the custom op implementation as a Mojo package in the custom_extensions argument.

For more detail, see the tutorial to build custom ops for GPUs.

Additionally, we've added a new gpu package to the Mojo standard library that provides low-level programming constructs for working with GPUs. These APIs let you do things that you can't currently do with the high-level foreach() abstraction above. The Mojo gpu APIs allow you to manually manage interaction between the CPU host and GPU device, manage memory between devices, synchronize threads, and more. For some examples, see vector_addition.mojo and top_k.mojo.

Mojo

Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.

Was this page helpful?