Run inference with Python
The Python API for MAX Engine enables you to upgrade your runtime performance for PyTorch, TensorFlow, and ONNX models, on a wide range of hardware, with just three lines of code (not counting the import):
from max import engine
# Load your model:
session = engine.InferenceSession()
model = session.load(model_path)
# Prepare the inputs, then run an inference:
outputs = model.execute(**inputs)
# Process the output here.
That's all you need! Everything else is the usual code to prepare your inputs and process the outputs.
But, it's always nice to see a fully working example. So the rest of this page shows how to run an inference using a version of RoBERTa from Cardiff NLP, which is a language model trained on tweets to perform sentiment analysis.
This example uses is a PyTorch model (converted to TorchScript format), and it's just as easy to load a model from ONNX or TensorFlow (in SavedModel format).
You can try this code yourself using the notebook in our GitHub repo.
Install the MAX Engine Python package
Naturally, you first need to install the max
Python package.
This package is not hosted in a package repository (PyPI), and can only be
installed with the modular
CLI tool.
For instructions, see Get started with MAX Engine.
Import Python modules
To start coding, we need some libraries that help us get the model and process the input/output data.
Make sure you have these packages installed:
python3 -m pip install torch transformers
from pathlib import Path
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from max import engine
Download the model
Now we download the RoBERTa model from HuggingFace and save it in the PyTorch TorchScript format.
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)
# Converting model to TorchScript
model_path = Path("roberta.torchscript")
batch = 1
seqlen = 128
inputs = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
with torch.no_grad():
traced_model = torch.jit.trace(
hf_model, example_kwarg_inputs=dict(inputs), strict=False
)
torch.jit.save(traced_model, model_path)
Load the model
Then, we load and compile the model in MAX Engine using an
InferenceSession
.
TensorFlow models must be in a SavedModel and PyTorch models must be in a TorchScript. Read more.
Define input specs (TorchScript only)
If you're using a PyTorch model (it must be in TorchScript format), you need to specify the input specifications for each of the model inputs before you can compile the model.
To define the input specs, you need to create a list of
TorchInputSpec
objects (one for each input tensor), and pass the list to
TorchLoadOptions
.
Although you must specify all input shapes, the shapes can be dynamic:
simply specify None
for any dimension size that's dynamic.
For example, here's how to declare the input specs for the RoBERTa TorchScript model:
# We use the same `inputs` that we used above to trace the model
input_spec_list = [
engine.TorchInputSpec(shape=tensor.size(), dtype=engine.DType.int64)
for tensor in inputs.values()
]
options = engine.TorchLoadOptions(input_spec_list)
Then pass options
to load()
along with the model path, below.
Load and compile the model
Now we instantiate an
InferenceSession
and load the model (if you're loading a TensorFlow or ONNX model, you
don't need the options
argument):
session = engine.InferenceSession()
model = session.load(model_path, options)
That's two lines down, just one to go.
The first time you load a model, it might take a few minutes to compile it, but this up-front cost will pay dividends in latency savings provided by our next-generation graph compiler.
Prepare the input
This part is your usual pre-processing.
For the RoBERTa model, we need to process the text input into a sequence of tokens, so we'll do that with transformers.AutoTokenizer
.
First, let's take a look at the model's inputs:
for tensor in model.input_metadata:
print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')
This tells us the model needs 2 inputs. (If your model shows a dimension size
is None
, that means it's dynamic.)
INPUT="There are many exciting developments in the field of AI Infrastructure!"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(INPUT, return_tensors="pt", padding='max_length', truncation=True, max_length=seqlen)
print(inputs)
Run an inference
Now for that third line of code, we pass the inputs to
execute()
. This
function requires all inputs as keyword arguments, so we'll
unpack the inputs
dictionary as we pass it through:
outputs = model.execute(**inputs)
print(outputs)
That's it!
The output from execute()
is a dictionary of output tensors, each in an ndarray
. Let's now figure out what they say.
Process the outputs
Again, we'll use some help from the transformers library to convert the output ids to labels:
# Extract class prediction from output
predicted_class_id = outputs["result0"]["logits"].argmax(axis=-1)[0]
classification = hf_model.config.id2label[predicted_class_id]
print(f"The sentiment is: {classification}")
Ta-da! 🎉
If you're running this notebook yourself, beware that this notebook does not illustrate MAX Engine's runtime performance. For actual benchmark results, try our benchmark tool or check out our performance dashboard.
For more details about the inferencing API, see the Python API reference.