Run inference with Mojo
The Mojo API for MAX Engine helps extend performance gains from the MAX Engine runtime into your application, by enabling you to build your entire inference application in high-performance Mojo code.
With any of the MAX Engine API libraries (also available in Python and C), you can run inference with models from TensorFlow, PyTorch, and ONNX at incredible speeds on a wide range of hardware.
As we'll show in more detail below, there are 3 essential lines of code that you need to execute your model:
from max import engine
fn main() raises:
# Load your model:
var session = engine.InferenceSession()
var model = session.load_model(model_path)
# Get the inputs, then run an inference:
var outputs = model.execute(inputs)
# Process the output here.
There's certainly more ceremony required to support these calls, but these are the basic APIs you need to know.
Now, let's walk through each step required to load and run the RoBERTa model from TensorFlow.
The Mojo API for MAX Engine is still in development and subject to change. Additionally, the Mojo language itself is still evolving, and some features you might expect from Python are not fully implemented yet (such as keyword-argument passing). Please share any issues you discover on GitHub.
Install the MAX SDK
Naturally, you need to install the MAX SDK, which includes the MAX Engine Mojo API and the Mojo SDK.
For instructions, see Get started with MAX Engine.
Import Mojo modules
Everything we need to run an inference with MAX Engine comes from the
max.engine
package. The rest are supporting
APIs from the Mojo standard library.
from max import engine
from pathlib import Path
from python import Python
from tensor import Tensor, TensorShape, TensorSpec
from algorithm import argmax
Just in case you're new to Mojo, it's important to know that Mojo requires a
main()
function as the program entry point.
So, from here on out, imagine that the all the code on this page goes inside this function:
fn main() raises:
# The rest of the code goes here
We declare that the function raises
because we're going to call
other functions that may raise exceptions.
Load the model
First, let's load and compile the model in MAX Engine using an
InferenceSession
.
TensorFlow models must be in a SavedModel and PyTorch models must be in a TorchScript. Read more.
Define input specs (TorchScript only)
If you're using a PyTorch model (it must be in TorchScript format), you need to specify the input specifications for each of the model inputs before you can compile the model.
For each input, you need to create a
TensorSpec
, and pass it to
LoadOptions.add_input_spec()
.
Although you need to specify all input shapes, the shapes can be dynamic:
simply specify None
for any dimension size that's dynamic.
Here's how you can specify the input specs for the RoBERTa model:
var batch = 1
var seqlen = 128
var input_ids_spec = TensorSpec(DType.int64, batch, seqlen)
var attention_mask_spec = TensorSpec(DType.int64, batch, seqlen)
var options = engine.LoadOptions()
options.add_input_spec(input_ids_spec)
options.add_input_spec(attention_mask_spec)
Then pass options
to load_model()
along with the model path, below.
Load and compile the model
Now we instantiate an InferenceSession
and then load-and-compile the model by passing the model path to
load_model()
(if you're
loading a TorchScript model, also pass in options
):
var model_path = "roberta"
var session = engine.InferenceSession()
var model = session.load_model(model_path)
Some models might take a few minutes to compile the first time you call
load_model()
, but this up-front cost will pay in dividends with latency
savings provided by our next-generation graph compiler.
Prepare the input
This is your usual pre-processing step to prepare input for the model. For the RoBERTa model, we need to process the text input into a sequence of tokens.
Because Mojo is designed as a superset of Python, we can leverage all of the
world's amazing Python libraries in our Mojo project.
For this task, we need a string tokenizer, so let's import the
🤗 Transformers Python library, which includes
transformers.AutoTokenizer
:
Make sure you have this package installed:
python3 -m pip install transformers
# This is equivalent to `import transformers` in Python
var transformers = Python.import_module("transformers")
# Tokenize the input string
var INPUT="There are many exciting developments in the field of AI Infrastructure!"
var HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
var tokenizer = transformers.AutoTokenizer.from_pretrained(HF_MODEL_NAME)
var inputs = tokenizer(INPUT, None, None, None, True, 'max_length', True,
seqlen, 0, False, None, 'np', True, None, False, False, False, False, True)
We're almost ready to run an inference.
Depending on the model you use, importing the Python torch
and tensorflow
packages—or libraries that transitively import them (such as
transformers
)—along with the Mojo max.engine
API, might fail to compile.
See here for detail.
Run an inference
The tokenized inputs
we get from Transformers is a dictionary in which each
input name (each key) is mapped to a NumPy array (the input tensor).
Unfortunately, Mojo currently doesn't have complete support for keyword
arguments in functions, so we need to manually unpack
this dictionary and pass each input to
execute()
.
In this case, we're calling the overloaded version of
execute()
that
accepts each input by name and PythonObject
(each NumPy array):
var input_ids = inputs["input_ids"]
var token_type_ids = inputs["token_type_ids"]
var attention_mask = inputs["attention_mask"]
var outputs = model.execute("input_ids", input_ids,
"token_type_ids", token_type_ids,
"attention_mask", attention_mask)
The output from execute()
is a Mojo Tensor
, which we'll now
process to get our results.
Process the outputs
The Tensor
type doesn't offer all the
same conveniences you might be used to with NumPy, so we currently need to
create our own version of argmax
so we can get the top classification result
from the output tensor. (The argmax
functionality is coming soon to the
Tensor
type.)
Here's our version of an argmax
function for a Tensor
(this, of course, does
not go inside the main()
function):
def argmax_tensor(
borrowed input: Tensor[DType.float32]
) -> Scalar[DType.float32]:
var output = Tensor[DType.float32](TensorShape(1, 1))
argmax(input._to_ndbuffer[2](), -1, output._to_ndbuffer[2]())
return output[0]
Then, back in our main()
function, we call argmax_tensor()
, passing it the
"logits"
tensor from the model outputs:
var logits = outputs.get[DType.float32]("logits")
var predicted_class_id = argmax_tensor(logits)
And, finally, to map the top classification ID to a label (the sentiment name),
we'll again get some help from 🤗 Transformers
TFAutoModelForSequenceClassification
:
var classification = hf_model.config.id2label[predicted_class_id]
print("The sentiment is:", classification)
Boom! You're now executing models with Mojo! 🔥
The Mojo API for MAX Engine is still in development and subject to change. Please report any issues on GitHub.
For a more stable experience, check out the Python API for MAX Engine.
For more details about the API, see the Mojo MAX Engine reference.