Skip to main content

Run inference with Mojo

The Mojo API for MAX Engine helps extend performance gains from the MAX Engine runtime into your application, by enabling you to build your entire inference application in high-performance Mojo code.

With any of the MAX Engine API libraries (also available in Python and C), you can run inference with models from PyTorch and ONNX at incredible speeds on a wide range of hardware.

As we'll show in more detail below, there are 3 essential lines of code that you need to execute your model:

from max import engine

def main():
# Load your model:
session = engine.InferenceSession()
model = session.load(model_path)

# Get the inputs, then run an inference:
outputs = model.execute(inputs)
# Process the output here.

There's certainly more ceremony required to support these calls, but these are the basic APIs you need to know.

Now, let's walk through each step required to load and run a BERT model from PyTorch.

Preview

The Mojo API for MAX Engine is still in development and subject to change. Additionally, the Mojo language itself is still evolving, and some features you might expect from Python are not fully implemented yet (such as keyword-argument passing). Please share any issues you discover on GitHub.

Prerequisites

  • You need the MAX SDK, which includes the MAX Engine API and the Mojo SDK.

    See the MAX install guide.

  • For this example, we also use the Python transformers library to get the BERT model and tokenize/decode the text.

    You can install the library with this command:

    python3 -m pip install transformers

Import Mojo modules

Everything we need to run an inference with MAX Engine comes from the max.engine package. The rest are supporting APIs from the Mojo standard library.

from max.engine import InputSpec, InferenceSession
from python import Python
from tensor import TensorSpec

In case you're new to Mojo, it's important to know that Mojo requires a main() function as the program entry point.

So, from here on out, imagine that the all the code runs inside this function:

def main():
# The rest of the code goes here

The first thing we want to do inside the main() function is load any Python modules we plan to use. In this case, we're going to use HuggingFace Transformers to encode/decode our text strings, so let's load that Python module:

# This is equivalent to `import transformers` in Python
transformers = Python.import_module("transformers")

The transformers variable behaves like a Python module name from now on but it is still just a variable, which is scoped to the current function. If you want to use transformers in a function other than main(), then you need to put this line inside that function instead of main().

Known issue

Importing the Python torch package—or other libraries that transitively import it (such as transformers)—along with the Mojo max.engine API, might cause your model to fail compilation. See here for detail.

Load the model

First, let's load and compile the model in MAX Engine using an InferenceSession.

note

PyTorch models must be in TorchScript format. Read more.

Download the TorchScript model

You can download the BERT TorchScript model used in this tutorial from our GitHub repo with these commands, which saves the model to your current directory:

git clone https://github.com/modularml/max.git
python3 max/examples/inference/common/bert-torchscript/download-model.py \
-o bert-mlm.torchscript --mlm

Define input specs (TorchScript only)

When you're using a PyTorch model (it must be in TorchScript format), you need to specify the input specifications for each of the model inputs before you can compile the model.

For each input, you need to create a list of TensorSpec values, and pass it to InferenceSession.load().

note

Although you need to specify all input shapes, the shapes can be dynamic: simply specify None for any dimension size that's dynamic.

Here's how you can specify the input specs for the BERT model:

batch = 1
seqlen = 128

input_ids_spec = TensorSpec(DType.int64, batch, seqlen)
token_type_ids_spec = TensorSpec(DType.int64, batch, seqlen)
attention_mask_spec = TensorSpec(DType.int64, batch, seqlen)
input_specs = List[InputSpec]()

input_specs.append(input_ids_spec)
input_specs.append(attention_mask_spec)
input_specs.append(token_type_ids_spec)

Next, you'll load the model with these input specs.

Load and compile the model

Now we instantiate an InferenceSession and then load-and-compile the model by passing the model path to load() (if you're loading a TorchScript model, also pass in input_specs):

model_path =  "bert-mlm.torchscript"
session = InferenceSession()
model = session.load(model_path, input_specs=input_specs)
compile time

Some models might take a few minutes to compile the first time you call load(), but this up-front cost will pay in dividends with latency savings provided by our next-generation graph compiler.

Prepare the input

This is your usual pre-processing step to prepare input for the model. For the BERT model, we need to process the text input into a sequence of tokens.

Because Mojo is designed as a superset of Python, we can leverage all of the world's amazing Python libraries in our Mojo project. For this task, we need a string tokenizer, so we're using the transformers.AutoTokenizer from the 🤗 Transformers API:

INPUT=String("Paris is the [MASK] of France.")

tokenizer = transformers.AutoTokenizer.from_pretrained(
"bert-base-uncased"
)

# Get the maximum sequence length from the model's output metadata
output_spec = model.get_model_output_metadata()[0]
max_seqlen = output_spec[1].value()[]

# Tokenize the input text
inputs = tokenizer(
text=INPUT,
add_special_tokens=True,
padding="max_length",
truncation=True,
max_length=max_seqlen,
return_tensors="np",
)

Run an inference

The tokenized inputs we get from Transformers is a dictionary in which each input name (each key) is mapped to a NumPy array (the input tensor). Currently, Mojo doesn't have complete support for keyword arguments in functions, so we need to manually unpack this dictionary and pass each input to execute().

In this case, we're calling the overloaded version of execute() that accepts each input as a name and a PythonObject value (the NumPy array):

input_ids = inputs["input_ids"]
token_type_ids = inputs["token_type_ids"]
attention_mask = inputs["attention_mask"]

# Now we can run inference
outputs = model.execute("input_ids", input_ids,
"token_type_ids", token_type_ids,
"attention_mask", attention_mask)

The output from execute() is a TensorMap, which we'll now process to get our results.

Process the outputs

We'll again leverage the Transformers API to decode the predicted token:

logits = outputs.get[DType.float32]("result0")

# Find the index of the mask token
mask_idx = -1
for i in range(len(input_ids[0])):
if input_ids[0][i] == tokenizer.mask_token_id:
mask_idx = i

# Decode the predicted token
predicted_token_id = logits.argmax()[mask_idx]
decoded_result = tokenizer.decode(
predicted_token_id,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)

print("input text: ", INPUT)
print("filled mask: ", INPUT.replace("[MASK]", decoded_result))
input text: Paris is the [MASK] of France. filled mask: Paris is the capital of France.

Now you're running models with Mojo! 🔥

caution

The Mojo API for MAX Engine is still in development and subject to change. Please report any issues on GitHub.

For a more stable experience, check out the Python API for MAX Engine.

For more details about the API, see the Mojo MAX Engine reference.