Run inference with Mojo

The Mojo API for MAX Engine helps extend performance gains from the MAX Engine runtime into your application, by enabling you to build your entire inference application in high-performance Mojo code.

With any of the MAX Engine API libraries (also available in Python and C), you can run inference with models from TensorFlow, PyTorch, and ONNX at incredible speeds on a wide range of hardware.

As we'll show in more detail below, there are 3 essential lines of code that you need to execute your model:

from max import engine

fn main() raises:
    # Load your model:
    var session = engine.InferenceSession()
    var model = session.load_model(model_path)

    # Get the inputs, then run an inference:
    var outputs = model.execute(inputs)
    # Process the output here.

There's certainly more ceremony required to support these calls, but these are the basic APIs you need to know.

Now, let's walk through each step required to load and run the RoBERTa model from TensorFlow.

This is a preview

The Mojo API for MAX Engine is still in development and subject to change. Additionally, the Mojo language itself is still evolving, and some features you might expect from Python are not fully implemented yet (such as keyword-argument passing). Please share any issues you discover on GitHub.

Install the MAX SDK

Naturally, you need to install the MAX SDK, which includes the MAX Engine Mojo API and the Mojo SDK.

For instructions, see Get started with MAX Engine.

Import Mojo modules

Everything we need to run an inference with MAX Engine comes from the max.engine package. The rest are supporting APIs from the Mojo standard library.

from max import engine
from pathlib import Path
from python import Python
from tensor import Tensor, TensorShape, TensorSpec
from algorithm import argmax

Just in case you're new to Mojo, it's important to know that Mojo requires a main() function as the program entry point.

So, from here on out, imagine that the all the code on this page goes inside this function:

fn main() raises:
    # The rest of the code goes here

note

We declare that the function raises because we're going to call other functions that may raise exceptions.

Load the model

First, let's load and compile the model in MAX Engine using an InferenceSession.

note

TensorFlow models must be in a SavedModel and PyTorch models must be in a TorchScript. Read more.

Define input specs (TorchScript only)

If you're using a PyTorch model (it must be in TorchScript format), you need to specify the input specifications for each of the model inputs before you can compile the model.

For each input, you need to create a TensorSpec, and pass it to LoadOptions.add_input_spec().

note

Although you need to specify all input shapes, the shapes can be dynamic: simply specify None for any dimension size that's dynamic.

Here's how you can specify the input specs for the RoBERTa model:

var batch = 1
var seqlen = 128
var input_ids_spec = TensorSpec(DType.int64, batch, seqlen)
var attention_mask_spec = TensorSpec(DType.int64, batch, seqlen)

var options = engine.LoadOptions()
options.add_input_spec(input_ids_spec)
options.add_input_spec(attention_mask_spec)

Then pass options to load_model() along with the model path, below.

Load and compile the model

Now we instantiate an InferenceSession and then load-and-compile the model by passing the model path to load_model() (if you're loading a TorchScript model, also pass in options):

var model_path = "roberta"
var session = engine.InferenceSession()
var model = session.load_model(model_path)

compile time

Some models might take a few minutes to compile the first time you call load_model(), but this up-front cost will pay in dividends with latency savings provided by our next-generation graph compiler.

Prepare the input

This is your usual pre-processing step to prepare input for the model. For the RoBERTa model, we need to process the text input into a sequence of tokens.

Because Mojo is designed as a superset of Python, we can leverage all of the world's amazing Python libraries in our Mojo project. For this task, we need a string tokenizer, so let's import the 🤗 Transformers Python library, which includes transformers.AutoTokenizer:

prerequisite

Make sure you have this package installed:

python3 -m pip install transformers

# This is equivalent to `import transformers` in Python
var transformers = Python.import_module("transformers")

# Tokenize the input string
var INPUT="There are many exciting developments in the field of AI Infrastructure!"
var HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
var tokenizer = transformers.AutoTokenizer.from_pretrained(HF_MODEL_NAME)
var inputs = tokenizer(INPUT, None, None, None, True, 'max_length', True,
    seqlen, 0, False, None, 'np', True, None, False, False, False, False, True)

We're almost ready to run an inference.

Known issue

Depending on the model you use, importing the Python torch and tensorflow packages—or libraries that transitively import them (such as transformers)—along with the Mojo max.engine API, might fail to compile. See here for detail.

Run an inference

The tokenized inputs we get from Transformers is a dictionary in which each input name (each key) is mapped to a NumPy array (the input tensor). Unfortunately, Mojo currently doesn't have complete support for keyword arguments in functions, so we need to manually unpack this dictionary and pass each input to execute().

In this case, we're calling the overloaded version of execute() that accepts each input by name and PythonObject (each NumPy array):

var input_ids = inputs["input_ids"]
var token_type_ids = inputs["token_type_ids"]
var attention_mask = inputs["attention_mask"]
var outputs = model.execute("input_ids", input_ids,
                            "token_type_ids", token_type_ids,
                            "attention_mask", attention_mask)

The output from execute() is a Mojo Tensor, which we'll now process to get our results.

Process the outputs

The Tensor type doesn't offer all the same conveniences you might be used to with NumPy, so we currently need to create our own version of argmax so we can get the top classification result from the output tensor. (The argmax functionality is coming soon to the Tensor type.)

Here's our version of an argmax function for a Tensor (this, of course, does not go inside the main() function):

def argmax_tensor(
    borrowed input: Tensor[DType.float32]
) -> Scalar[DType.float32]:
    var output = Tensor[DType.float32](TensorShape(1, 1))

    argmax(input._to_ndbuffer[2](), -1, output._to_ndbuffer[2]())

    return output[0]

Then, back in our main() function, we call argmax_tensor(), passing it the "logits" tensor from the model outputs:

var logits = outputs.get[DType.float32]("logits")
var predicted_class_id = argmax_tensor(logits)

And, finally, to map the top classification ID to a label (the sentiment name), we'll again get some help from 🤗 Transformers TFAutoModelForSequenceClassification:

var classification = hf_model.config.id2label[predicted_class_id]
print("The sentiment is:", classification)

The sentiment is: joy

Boom! You're now executing models with Mojo! 🔥

caution

The Mojo API for MAX Engine is still in development and subject to change. Please report any issues on GitHub.

For a more stable experience, check out the Python API for MAX Engine.

For more details about the API, see the Mojo MAX Engine reference.

Install the MAX SDK​

Import Mojo modules​

Load the model​

Define input specs (TorchScript only)​

Load and compile the model​

Prepare the input​

Run an inference​

Process the outputs​

Install the MAX SDK

Import Mojo modules

Load the model

Define input specs (TorchScript only)

Load and compile the model

Prepare the input

Run an inference

Process the outputs