Skip to main content

Get started with MAX Engine

Welcome to the MAX Engine setup guide!

Within a matter of minutes, you’ll install the MAX SDK Preview and run inference with some of our code examples.

Preview release

We're excited to share this preview version of the MAX SDK! For details about what's included, see the MAX changelog, and for details about what's yet to come, see the roadmap and known issues.

Requirements

First, make sure your system meets these requirements:

  • Linux Ubuntu 20.04/22.04 LTS
  • x86-64 CPU (with SSE4.2 or newer) or AWS Graviton2/3 CPU
  • Minimum 8 GiB RAM
  • Python 3.8 - 3.11
  • g++/clang++ C++ compiler

We'll add support for macOS and Windows in future releases.

1. Install the MAX SDK

By downloading the MAX SDK, you understand and agree to the MAX software license.

Updating?

If you already installed MAX, instead see the update guide.

  1. Open a terminal and install the modular command line tool with this helper script:

    curl -s https://get.modular.com | sh -
    Or, click here to see the manual install commands.
    apt-get install -y apt-transport-https &&
    keyring_location=/usr/share/keyrings/modular-installer-archive-keyring.gpg &&
    curl -1sLf 'https://dl.modular.com/bBNWiLZX5igwHXeu/installer/gpg.0E4925737A3895AD.key' | gpg --dearmor >> ${keyring_location} &&
    curl -1sLf 'https://dl.modular.com/bBNWiLZX5igwHXeu/installer/config.deb.txt?distro=debian&codename=wheezy' > /etc/apt/sources.list.d/modular-installer.list &&
    apt-get update &&
    apt-get install -y modular
  2. Sign into your Modular account:

    modular auth
  3. Install the MAX SDK:

    modular install max
  4. Install the MAX Engine Python package:

    MAX_PATH=$(modular config max.path) \
    && python3 -m pip install --find-links $MAX_PATH/wheels max-engine
  5. Set environment variables so you can access the max and mojo CLIs:

    If you're using Bash, run this command:

    MAX_PATH=$(modular config max.path) \
    && BASHRC=$( [ -f "$HOME/.bash_profile" ] && echo "$HOME/.bash_profile" || echo "$HOME/.bashrc" ) \
    && echo 'export MODULAR_HOME="'$HOME'/.modular"' >> "$BASHRC" \
    && echo 'export PATH="'$MAX_PATH'/bin:$PATH"' >> "$BASHRC" \
    && source "$BASHRC"

Okay, the MAX SDK is now installed and configured!

The MAX SDK includes the MAX Engine runtime, the Python, C, and Mojo API bindings, the max CLI tool that you can use to benchmark and visualize your models, and the complete Mojo SDK, including the mojo CLI tool.

note

To help us improve MAX, we collect some telemetry data and crash reports. Learn more.

2. Run your first model

Let's start with something boring, similar to a "Hello world," just to make sure MAX Engine is working.

First, clone the code examples:

git clone https://github.com/modularml/max.git

Now let's run inference using a TorchScript model and our Python API. We'll start with a version of BERT that's trained to predict the masked words in a sentence.

  1. Starting from where you cloned the repo, go into the example and install the Python requirements:

    cd max/examples/inference/bert-python-torchscript
    python3 -m pip install -r requirements.txt
  2. Download and run the model with this script:

    bash run.sh

    This script downloads the BERT model and runs it with some input text.

You should see results like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

Cool, it works! (If it didn't work, let us know.)

Compile time

The first time you run an example, it will take some time to compile the model. This might seem strange if you're used to "eager execution" in ML frameworks, but this is where MAX Engine optimizes the graph to deliver more performance. This happens only when you load the model, and it's an up-front cost that pays dividends with major latency savings at run time.

This wasn't meant to blow your mind with performance. It's just an API example that shows how to use our Python API to load and run a model, so there's no benchmark measurement.

Rest assured, MAX Engine does execute models very fast, without any changes to the models. To see how MAX Engine compares when executing different models on different CPU architectures, see our performance dashboard.

Figure 1. MAX Engine latency speed-up when running Mistral-7B vs PyTorch (MAX Engine is 2.5x faster).

But seeing is believing. So, we created a program that compares our performance to PyTorch.

3. Run the performance showcase

The premise for this program is simple: It runs the same model (downloaded from HuggingFace) in PyTorch and MAX Engine, and measures the average execution time over several inferences.

Let's go!

  1. Starting again from where you cloned the repo, change directories and install the requirements:

    cd max/examples/performance-showcase
    python3 -m pip install -r requirements.txt
  2. Now start the showcase by specifying the model to run:

    python3 run.py -m roberta

This might take a few minutes the first time you run it.

When it's done, you'll see the inference queries per second (QPS; higher is better) listed for each runtime, like this (results vary based on hardware):

Running with PyTorch
.............................................................. QPS: 18.41

Running with MAX Engine
Compiling model.
Done!
.............................................................. QPS: 33.11
MAX Performance

There are no tricks here! (See the code for yourself.) MAX Engine wins because our compiler uses next-generation technology to optimize the graph and extract more performance, without any accuracy loss. And our performance will only get faster and faster in future versions! If you got slow results, see this answer.

To start using MAX Engine in your own project, just drop in the MAX Engine API and start calling it for each inference request. For details, see how to run inference with Python or with C.

But, maybe you're thinking we're showing only the models that make us look good here. Well, see for yourself by benchmarking any model!

4. Benchmark any model

With the benchmark tool, you can benchmark any compatible model with an MLPerf scenario. It runs the model several times with generated inputs (or inputs you provide), and prints the performance results.

For example, here’s how to benchmark an example model from HuggingFace:

  1. Download the model with this script in our GitHub repo:

    cd max/examples/tools/common/resnet50-pytorch
    bash download-model.sh --output resnet50.torchscript
  2. Then benchmark the model:

    max benchmark resnet50.torchscript --input-data-schema=input-spec.yaml

This compiles the model, runs it several times, and prints the benchmark results. (Again, it might take a few minutes to compile the model before benchmarking it.)

The output is rather long, so this is just part of what you should see (your results will differ based on hardware):

================================================
Additional Stats
================================================
QPS w/ loadgen overhead : 44.024
QPS w/o loadgen overhead : 44.048

Min latency (ns) : 21909338
Max latency (ns) : 24319980
Mean latency (ns) : 22702682
50.00 percentile latency (ns) : 22698762
90.00 percentile latency (ns) : 23095239
95.00 percentile latency (ns) : 23212431
97.00 percentile latency (ns) : 23325674
99.00 percentile latency (ns) : 23489326
99.90 percentile latency (ns) : 24319980

Now try benchmarking your own model! Just be sure it's in one of our supported model formats.

Also be aware that the benchmark tool needs to know the model's input shapes so it can generate inputs, and not all models provide input shape metadata. If your model doesn't include that metadata, then you need to specify the input shapes. Or, you can provide your own input data in a NumPy file. Learn more in the benchmark guide.

Share Feedback

We’d love to hear about about your experience benchmarking other models. If you have any issues, let us know.

Next steps

That's not all you can do. Here’s some more documentation to explore:

And this is just the beginning!

In the coming months, we'll add support for GPU hardware, more quantized models, MAX SDK for macOS and Windows, and solutions for production deployment with MAX.

Also, we're aware that MAX has some sharp edges, some features aren't quite done, and others don't exist yet. For details about the known issues and features we're working on, please see the roadmap and known issues.

Join the discussion

Get in touch with other MAX developers, ask questions, and share feedback on Discord and GitHub.