Skip to main content

Run inference with Python

The Python API for MAX Engine enables you to upgrade your runtime performance for PyTorch, TensorFlow, and ONNX models, on a wide range of hardware, with just three lines of code (not counting the import):

from max import engine

# Load your model:
session = engine.InferenceSession()
model = session.load(model_path)

# Prepare the inputs, then run an inference:
outputs = model.execute(**inputs)

# Process the output here.

That's all you need! Everything else is the usual code to prepare your inputs and process the outputs.

But, it's always nice to see a fully working example. So the rest of this page shows how to run an inference using a version of RoBERTa from Cardiff NLP, which is a language model trained on tweets to perform sentiment analysis.

This example uses is a PyTorch model (converted to TorchScript format), and it's just as easy to load a model from ONNX or TensorFlow (in SavedModel format).

Try it

You can try this code yourself using the notebook in our GitHub repo.

Install the MAX Engine Python package

Naturally, you first need to install the max Python package. This package is not hosted in a package repository (PyPI), and can only be installed with the modular CLI tool.

For instructions, see Get started with MAX Engine.

Import Python modules

To start coding, we need some libraries that help us get the model and process the input/output data.

prerequisite

Make sure you have these packages installed:

python3 -m pip install torch transformers
from pathlib import Path

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

from max import engine

Download the model

Now we download the RoBERTa model from HuggingFace and save it in the PyTorch TorchScript format.

HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)

# Converting model to TorchScript
model_path = Path("roberta.torchscript")
batch = 1
seqlen = 128
inputs = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
with torch.no_grad():
traced_model = torch.jit.trace(
hf_model, example_kwarg_inputs=dict(inputs), strict=False
)

torch.jit.save(traced_model, model_path)

Load the model

Then, we load and compile the model in MAX Engine using an InferenceSession.

note

TensorFlow models must be in a SavedModel and PyTorch models must be in a TorchScript. Read more.

Define input specs (TorchScript only)

If you're using a PyTorch model (it must be in TorchScript format), you need to specify the input specifications for each of the model inputs before you can compile the model.

To define the input specs, you need to create a list of TorchInputSpec objects (one for each input tensor), and pass the list to TorchLoadOptions.

note

Although you must specify all input shapes, the shapes can be dynamic: simply specify None for any dimension size that's dynamic.

For example, here's how to declare the input specs for the RoBERTa TorchScript model:

# We use the same `inputs` that we used above to trace the model
input_spec_list = [
engine.TorchInputSpec(shape=tensor.size(), dtype=engine.DType.int64)
for tensor in inputs.values()
]
options = engine.TorchLoadOptions(input_spec_list)

Then pass options to load() along with the model path, below.

Load and compile the model

Now we instantiate an InferenceSession and load the model (if you're loading a TensorFlow or ONNX model, you don't need the options argument):

session = engine.InferenceSession()
model = session.load(model_path, options)

That's two lines down, just one to go.

compile time

The first time you load a model, it might take a few minutes to compile it, but this up-front cost will pay dividends in latency savings provided by our next-generation graph compiler.

Prepare the input

This part is your usual pre-processing. For the RoBERTa model, we need to process the text input into a sequence of tokens, so we'll do that with transformers.AutoTokenizer.

First, let's take a look at the model's inputs:

for tensor in model.input_metadata:
print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')
name: input_ids, shape: [1, 128], dtype: DType.int64 name: attention_mask, shape: [1, 128], dtype: DType.int64

This tells us the model needs 2 inputs. (If your model shows a dimension size is None, that means it's dynamic.)

INPUT="There are many exciting developments in the field of AI Infrastructure!"

tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(INPUT, return_tensors="pt", padding='max_length', truncation=True, max_length=seqlen)
print(inputs)
{'input_ids': tensor([[ 0, 970, 32, 171, 3571, 5126, 11, 5, 882, 9, 4687, 13469, 328, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

Run an inference

Now for that third line of code, we pass the inputs to execute(). This function requires all inputs as keyword arguments, so we'll unpack the inputs dictionary as we pass it through:

outputs = model.execute(**inputs)
print(outputs)
{'result0': {'logits': array([[-3.7987795 , 0.49929366, -4.2877274 , -2.586396 , 2.9503963 , -2.112092 , 2.507424 , -4.4121118 , -4.9013515 , -2.147359 , -0.5741746 ]], dtype=float32)}}

That's it!

The output from execute() is a dictionary of output tensors, each in an ndarray. Let's now figure out what they say.

Process the outputs

Again, we'll use some help from the transformers library to convert the output ids to labels:

# Extract class prediction from output
predicted_class_id = outputs["result0"]["logits"].argmax(axis=-1)[0]
classification = hf_model.config.id2label[predicted_class_id]

print(f"The sentiment is: {classification}")
The sentiment is: joy

Ta-da! 🎉

If you're running this notebook yourself, beware that this notebook does not illustrate MAX Engine's runtime performance. For actual benchmark results, try our benchmark tool or check out our performance dashboard.

For more details about the inferencing API, see the Python API reference.