Run a TorchScript model with Python
One of the key features of MAX is its unified graph compiler that accelerates existing PyTorch and ONNX models on a wide range of hardware. In this tutorial, we'll show you how to convert any PyTorch model into the TorchScript format and run it with our Python API for immediate performance gains.
For this project, we'll use a version of RoBERTa from Hugging Face that's trained to perform sentiment analysis. But you can use the same procedure with almost any other PyTorch model.
If you instead want to run an ONNX model, see the tutorial to Run an ONNX model with Python.
Create a virtual environment
It's important to create your project in a virtual environment so that your Python version and packages are compatible with this code. We'll use the Magic CLI to create the environment and install the required packages.
-
If you don't have the
magic
CLI yet, you can install it on macOS and Ubuntu Linux with this command:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. -
Create and enter the Python project:
magic init torchscript-tutorial --format pyproject && cd torchscript-tutorial
magic init torchscript-tutorial --format pyproject && cd torchscript-tutorial
-
Add
pytorch
to your package channels:magic project channel add pytorch --prepend
magic project channel add pytorch --prepend
The
--prepend
option is necessary to putpytorch
beforeconda-forge
, as per channel priority. This ensures that you install the official PyTorch package, instead of the version from conda-forge. -
Install MAX and other conda packages:
magic add "max~=24.6" "pytorch==2.4.0" "transformers==4.40.1"
magic add "max~=24.6" "pytorch==2.4.0" "transformers==4.40.1"
-
Now you can start a shell in the environment and see your MAX version:
magic shell
magic shell
python3 -c 'from max import engine; print(engine.__version__)'
python3 -c 'from max import engine; print(engine.__version__)'
24.6.0
24.6.0
Now you're ready to start coding.
Download the model
Let's start by downloading the RoBERTa model from Hugging Face and save it in the PyTorch TorchScript format (read more about supported model formats).
To save the model from Hugging Face as a TorchScript file, we'll use
torch.jit.trace()
,
which traces the execution of the model using some dummy input.
-
Create a file named
download-model.py
and paste this code:download-model.pyfrom pathlib import Path
import torch
from transformers import AutoModelForSequenceClassification
# The Hugging Face model name and exported file name
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
MODEL_PATH = Path("roberta.torchscript")
def main():
# Load the ResNet model from Hugging Face in evaluation mode
hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)
# Convert the model to TorchScript
batch = 1
seqlen = 128
input_spec = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
with torch.no_grad():
traced_model = torch.jit.trace(
hf_model, example_kwarg_inputs=dict(input_spec), strict=False
)
torch.jit.save(traced_model, MODEL_PATH)
if __name__ == "__main__":
main()from pathlib import Path
import torch
from transformers import AutoModelForSequenceClassification
# The Hugging Face model name and exported file name
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
MODEL_PATH = Path("roberta.torchscript")
def main():
# Load the ResNet model from Hugging Face in evaluation mode
hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)
# Convert the model to TorchScript
batch = 1
seqlen = 128
input_spec = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
with torch.no_grad():
traced_model = torch.jit.trace(
hf_model, example_kwarg_inputs=dict(input_spec), strict=False
)
torch.jit.save(traced_model, MODEL_PATH)
if __name__ == "__main__":
main() -
Now run the file (you must be inside the Magic environment already, via
magic shell
):python3 download-model.py
python3 download-model.py
You should now see the model saved as roberta.torchscript
.
Load the model
Now we can load and compile the TorchScript model in MAX.
Start by creating the executable file called run.py
with the required imports
and a main()
function:
from pathlib import Path
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from max import engine
from max.dtype import DType
# The Hugging Face model name and TorchScript file name
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
MODEL_PATH = Path("roberta.torchscript")
def main():
# This is where we'll add our code
if __name__ == "__main__":
main()
from pathlib import Path
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from max import engine
from max.dtype import DType
# The Hugging Face model name and TorchScript file name
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
MODEL_PATH = Path("roberta.torchscript")
def main():
# This is where we'll add our code
if __name__ == "__main__":
main()
In the following sections, we'll add our code to the main()
function.
Define TorchScript input specs
Before you can compile a TorchScript model, you need to specify the input names
and shapes for each of the model inputs as a list of
TorchInputSpec
objects.
TorchScript files don't provide this information as metadata, which MAX
requires when compiling the model.
To declare the input specs for the RoBERTa model, add this code to your
main()
function in main.py
(notice that input_spec
is the same
specification we used to trace the model in download-model.py
):
batch = 1
seqlen = 128
input_spec = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
input_spec_list = [
engine.TorchInputSpec(shape=tensor.size(), dtype=DType.int64)
for tensor in input_spec.values()
]
batch = 1
seqlen = 128
input_spec = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
input_spec_list = [
engine.TorchInputSpec(shape=tensor.size(), dtype=DType.int64)
for tensor in input_spec.values()
]
We'll use this input_spec_list
when we load the model next.
Load and compile the model
Now you can instantiate an
InferenceSession
and load the model:
session = engine.InferenceSession()
model = session.load(MODEL_PATH, input_specs=input_spec_list)
session = engine.InferenceSession()
model = session.load(MODEL_PATH, input_specs=input_spec_list)
Prepare the input
This part is normal input pre-processing. For the RoBERTa model, we need to
process the text input into a sequence of tokens, so we'll do that with
transformers.AutoTokenizer
.
First, let's print the loaded model's inputs:
for tensor in model.input_metadata:
print(
f"name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}"
)
for tensor in model.input_metadata:
print(
f"name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}"
)
You can now run the code to see the printed metadata (you must be inside the
Magic environment already, via magic shell
):
python3 run.py
python3 run.py
name: input_ids, shape: [1, 128], dtype: DType.int64
name: attention_mask, shape: [1, 128], dtype: DType.int64
name: input_ids, shape: [1, 128], dtype: DType.int64
name: attention_mask, shape: [1, 128], dtype: DType.int64
This output confirms what we already knew when we specified the input_spec
.
The model takes two inputs: one tensor of input tokens, and an attention mask.
To create the tokenized sentence and a corresponding mask, we'll use the Transformers tokenizer:
text_input="There are many exciting developments in the field of AI Infrastructure!"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(
text_input,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=seqlen,
)
print(inputs)
text_input="There are many exciting developments in the field of AI Infrastructure!"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(
text_input,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=seqlen,
)
print(inputs)
Run the script again to see the tokenized inputs:
python3 run.py
python3 run.py
{'input_ids': tensor([[ 0, 970, 32, 171, 3571, 5126, 11, 5, 882, 9,
4687, 13469, 328, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]])}
{'input_ids': tensor([[ 0, 970, 32, 171, 3571, 5126, 11, 5, 882, 9,
4687, 13469, 328, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]])}
Run an inference
To run an inference, pass the inputs to
execute()
. This function
requires all inputs as keyword arguments, so we'll unpack the inputs
we got
from the tokenizer as we pass it in:
outputs = model.execute_legacy(**inputs)
outputs = model.execute_legacy(**inputs)
The output from
execute()
is a
dictionary of output tensors, each in an ndarray
.