Skip to main content

Get started with MAX Engine

Welcome to the MAX Engine setup guide!

Within a matter of minutes, you’ll install the MAX SDK Preview and run inference with some of our code examples.

Preview release

We're excited to share this preview version of the MAX SDK! For details about what's included, see the MAX changelog, and for details about what's yet to come, see the roadmap and known issues.

Requirements

First, make sure your system meets these requirements:

  • Linux Ubuntu 20.04/22.04 LTS
  • x86-64 CPU (with SSE4.2 or newer) or AWS Graviton2/3 CPU
  • Minimum 8 GiB RAM
  • Python 3.8 - 3.11
  • g++/clang++ C++ compiler

We'll add support for macOS and Windows in future releases.

1. Install the MAX SDK

By downloading the MAX SDK, you understand and agree to the MAX software license.

Updating?

If you already installed MAX, instead see the update guide.

  1. Open a terminal and install the modular command line tool with this helper script:

    curl -s https://get.modular.com | sh -
    Or, click here to see the manual install commands.
    apt-get install -y apt-transport-https &&
    keyring_location=/usr/share/keyrings/modular-installer-archive-keyring.gpg &&
    curl -1sLf 'https://dl.modular.com/bBNWiLZX5igwHXeu/installer/gpg.0E4925737A3895AD.key' | gpg --dearmor >> ${keyring_location} &&
    curl -1sLf 'https://dl.modular.com/bBNWiLZX5igwHXeu/installer/config.deb.txt?distro=debian&codename=wheezy' > /etc/apt/sources.list.d/modular-installer.list &&
    apt-get update &&
    apt-get install -y modular
  2. Sign into your Modular account:

    modular auth
  3. Install the MAX SDK:

    modular install max
  4. Install the MAX Engine Python package:

    MAX_PATH=$(modular config max.path) \
    && python3 -m pip install --find-links $MAX_PATH/wheels max-engine
  5. Set environment variables so you can access the max and mojo CLIs:

    If you're using Bash, run this command:

    MAX_PATH=$(modular config max.path) \
    && BASHRC=$( [ -f "$HOME/.bash_profile" ] && echo "$HOME/.bash_profile" || echo "$HOME/.bashrc" ) \
    && echo 'export MODULAR_HOME="'$HOME'/.modular"' >> "$BASHRC" \
    && echo 'export PATH="'$MAX_PATH'/bin:$PATH"' >> "$BASHRC" \
    && source "$BASHRC"

Okay, the MAX SDK is now installed and configured!

The MAX SDK includes the MAX Engine runtime, the Python, C, and Mojo API bindings, the max CLI tool that you can use to benchmark and visualize your models, and the complete Mojo SDK, including the mojo CLI tool.

2. Run your first model

Let's start with something boring, similar to a "Hello world," just to make sure MAX Engine is working.

First, clone the code examples:

git clone https://github.com/modularml/max.git

Now select an example that runs inference with our Python API, using a model from PyTorch or TensorFlow:

This example uses a version of BERT that's trained to predict the masked words in a sentence.

  1. Starting from where you cloned the repo, go into the example and install the Python requirements:

    cd max/examples/inference/bert-python-torchscript
    python3 -m pip install -r requirements.txt
  2. Download and run the model with this script:

    bash run.sh

    This script downloads the BERT model and runs it with some input text.

You should see results like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

Cool, it works! (If it didn't work, let us know.)

Compile time

The first time you run an example, it will take some time to compile the model. This might seem strange if you're used to "eager execution" in PyTorch or TensorFlow, but this is where MAX Engine optimizes the graph to deliver more performance. This happens only when you load the model, and it's an up-front cost that pays dividends with major latency savings at run time.

This wasn't meant to blow your mind with performance. It's just an API example that shows how to use our Python API to load and run a PyTorch (TorchScript) or TensorFlow (SavedModel) model, so there's no benchmark measurement.

Rest assured, MAX Engine does execute models very fast: sometimes more than 3x faster than the stock frameworks, without any model changes. For example, check out our performance with Mistral 7B in figure 1.

Figure 1. MAX Engine latency with Mistral-7B vs PyTorch (lower is better). P50, P90, P95, and P99 latencies are the average latency for the 50th, 90th, 95th, and 99th percentile of inferences across a fixed period of time. For more charts like this, check out performance.modular.com.

But seeing is believing. So, we created a program that runs MAX Engine head-to-head with TensorFlow and PyTorch.

3. Run the performance showcase

The premise for this program is simple: It runs the same model (downloaded from HuggingFace) in TensorFlow, PyTorch, and MAX Engine, and measures the average execution time over several inferences.

Let's go!

  1. Starting again from where you cloned the repo, change directories and install the requirements:

    cd max/examples/performance-showcase
    python3 -m pip install -r requirements.txt
  2. Now start the showcase by specifying the model to run:

    python3 run.py -m roberta

This might take a few minutes the first time you run it.

When it's done, you'll see the inference queries per second (QPS; higher is better) listed for each runtime, like this (results vary based on hardware):

Running with TensorFlow
.............................................................. QPS: 15.07

Running with PyTorch
.............................................................. QPS: 18.41

Running with MAX Engine
Compiling model.
Done!
.............................................................. QPS: 33.11
MAX Performance

There are no tricks here! (See the code for yourself.) MAX Engine wins because our compiler uses next-generation technology to optimize the graph and extract more performance, without any accuracy loss. And our performance will only get faster and faster in future versions! If you got slow results, see this answer.

To start using MAX Engine in your own project, just drop in the MAX Engine API and start calling it for each inference request. For details, see how to run inference with Python or with C.

But, maybe you're thinking we're showing only the models that make us look good here. Well, see for yourself by benchmarking any model!

4. Benchmark any model

With the benchmark tool, you can benchmark any compatible model with an MLPerf scenario. Just pass it a TensorFlow SavedModel, PyTorch TorchScript, or ONNX model, and it runs the model several times with generated inputs (or inputs you provide), and prints the results.

note

TensorFlow models must be in a SavedModel and PyTorch models must be in a TorchScript. Read more.

For example, here’s how to benchmark an example model from HuggingFace:

  1. Download the model with this script in our GitHub repo:

    cd max/examples/tools/common/resnet50-tensorflow
    bash download-model.sh --output resnet50
  2. Then benchmark the model:

    max benchmark resnet50

This compiles the model, runs it several times, and prints the benchmark results. (Again, it might take a few minutes to compile the model before benchmarking it.)

The results are a bit long, so this is just part of what you should see (results vary based on hardware):

================================================
Additional Stats
================================================
QPS w/ loadgen overhead : 79.653
QPS w/o loadgen overhead : 79.718

Min latency (ns) : 12261815
Max latency (ns) : 17839301
Mean latency (ns) : 12544217
50.00 percentile latency (ns) : 12502976
90.00 percentile latency (ns) : 12726310
95.00 percentile latency (ns) : 12824830
97.00 percentile latency (ns) : 12919271
99.00 percentile latency (ns) : 13486430
99.90 percentile latency (ns) : 17839301

Now try benchmarking your own model!

Just be aware that the benchmark tool needs to know the model's input shapes so it can generate inputs, and not all models provide input shape metadata. If your model doesn't include that metadata, then you need to specify the input shapes. Or, you can provide your own input data in a NumPy file. Learn more in the benchmark guide.

Share Feedback

We’d love to hear about about your experience benchmarking other models. If you have any issues, let us know.

Next steps

That's not all you can do. Here’s some more documentation to explore:

And this is just the beginning!

In the coming months, we'll add support for GPU hardware, more quantized models, MAX SDK for macOS and Windows, and more production-ready solutions in the Enterprise Edition.

Also, we're aware that MAX has some sharp edges, some features aren't quite done, and others don't exist yet. For details about the known issues and features we're working on, please see the roadmap and known issues.

Join the discussion

Get in touch with other MAX developers, ask questions, and share feedback on Discord and GitHub.