Run inference with Python
The Python API for MAX Engine enables you to upgrade your runtime performance for PyTorch and ONNX models, on a wide range of hardware, with just three lines of code (not counting the import):
from max import engine
# Load your model:
session = engine.InferenceSession()
model = session.load(model_path)
# Prepare the inputs, then run an inference:
outputs = model.execute(**inputs)
# Process the output here.
That's all you need! Everything else is the usual code to prepare your inputs and process the outputs.
But, it's always nice to see a fully working example. So the rest of this page shows how to run an inference using a version of RoBERTa from Cardiff NLP, which is a language model trained on tweets to perform sentiment analysis.
This example uses is a PyTorch model (converted to TorchScript format), and it's just as easy to load a model from ONNX.
You can try this code yourself using the notebook in our GitHub repo.
Set up the project environment
After you install Magic, create a new Python project and install the dependencies:
magic init roberta-project && cd roberta-project
Add MAX and NumPy from conda:
magic add max "numpy<2.0"
Add PyTorch and Transformers from PyPI:
magic add --pypi "torch==2.2.2" "transformers==4.40.1"
Now you can start a shell in the environment and see your MAX version:
magic shell
python3 -c 'from max import engine; print(engine.__version__)'
Import Python modules
To start coding, we need the libraries that help us get the model and process the input/output data.
from pathlib import Path
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from max import engine
Download the model
Now we download the RoBERTa model from HuggingFace and save it in the PyTorch TorchScript format.
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)
# Converting model to TorchScript
model_path = Path("roberta.torchscript")
batch = 1
seqlen = 128
inputs = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
with torch.no_grad():
traced_model = torch.jit.trace(
hf_model, example_kwarg_inputs=dict(inputs), strict=False
)
torch.jit.save(traced_model, model_path)
Load the model
Then, we load and compile the model in MAX Engine using an
InferenceSession
.
PyTorch models must be in TorchScript format. Read more.
Define input specs (TorchScript only)
If you're using a PyTorch model (it must be in TorchScript format), you need to specify the input specifications for each of the model inputs before you can compile the model.
To define the input specs, you need to create a list of
TorchInputSpec
objects (one for each input tensor), and pass the list to
InferenceSession.load()
.
Although you must specify all input shapes, the shapes can be dynamic:
simply specify None
for any dimension size that's dynamic.
For example, here's how to declare the input specs for the RoBERTa TorchScript model:
# We use the same `inputs` that we used above to trace the model
input_spec_list = [
engine.TorchInputSpec(shape=tensor.size(), dtype=engine.DType.int64)
for tensor in inputs.values()
]
Then pass input_specs
to load()
along with the model path, below.
Load and compile the model
Now we instantiate an
InferenceSession
and load the model (if you're loading an ONNX model, you
don't need the input_specs
argument):
session = engine.InferenceSession()
model = session.load(model_path, input_specs=input_spec_list)
That's two lines down, just one to go.
The first time you load a model, it might take a few minutes to compile it, but this up-front cost will pay dividends in latency savings provided by our next-generation graph compiler.
Prepare the input
This part is your usual pre-processing.
For the RoBERTa model, we need to process the text input into a sequence of tokens, so we'll do that with transformers.AutoTokenizer
.
First, let's take a look at the model's inputs:
for tensor in model.input_metadata:
print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')
This tells us the model needs 2 inputs. (If your model shows a dimension size
is None
, that means it's dynamic.)
INPUT="There are many exciting developments in the field of AI Infrastructure!"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(INPUT, return_tensors="pt", padding='max_length', truncation=True, max_length=seqlen)
print(inputs)
Run an inference
Now for that third line of code, we pass the inputs to
execute()
. This
function requires all inputs as keyword arguments, so we'll
unpack the inputs
dictionary as we pass it through:
outputs = model.execute(**inputs)
print(outputs)
That's it!
The output from execute()
is a dictionary of output tensors, each in an ndarray
. Let's now figure out what they say.
Process the outputs
Again, we'll use some help from the transformers library to convert the output ids to labels:
# Extract class prediction from output
predicted_class_id = outputs["result0"]["logits"].argmax(axis=-1)[0]
classification = hf_model.config.id2label[predicted_class_id]
print(f"The sentiment is: {classification}")
Ta-da! 🎉
If you're running this notebook yourself, beware that this notebook does not illustrate MAX Engine's runtime performance. For actual benchmark results, try our benchmark tool or check out our performance dashboard.
For more details about the inferencing API, see the Python API reference.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
If you'd like to share more information, please report an issue on GitHub
😔 What went wrong?