Run a TorchScript model with Python

Technical Writer

18 min read

python

pytorch

One of the key features of MAX is its unified graph compiler that accelerates existing PyTorch and ONNX models on a wide range of hardware. In this tutorial, we'll show you how to convert any PyTorch model into the TorchScript format and run it with our Python API for immediate performance gains.

For this project, we'll use a version of RoBERTa from Hugging Face that's trained to perform sentiment analysis. But you can use the same procedure with almost any other PyTorch model.

If you instead want to run an ONNX model, see the tutorial to Run an ONNX model with Python.

Create a virtual environment

It's important to create your project in a virtual environment so that your Python version and packages are compatible with this code. We'll use the Magic CLI to create the environment and install the required packages.

If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:
curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the source command that's printed in your terminal.

Create and enter the Python project:

magic init torchscript-tutorial --format pyproject && cd torchscript-tutorial
magic init torchscript-tutorial --format pyproject && cd torchscript-tutorial

Add pytorch to your package channels:
```
magic project channel add pytorch --prepend
```
```
magic project channel add pytorch --prepend
```
The --prepend option is necessary to put pytorch before conda-forge, as per channel priority. This ensures that you install the official PyTorch package, instead of the version from conda-forge.

Install MAX and other conda packages:

magic add "max~=25.1" "pytorch==2.4.0" "transformers==4.40.1"
magic add "max~=25.1" "pytorch==2.4.0" "transformers==4.40.1"

Now you can start a shell in the environment and see your MAX version:

magic shell

magic shell

python3 -c 'from max import engine; print(engine.__version__)'
python3 -c 'from max import engine; print(engine.__version__)'

25.1.0

25.1.0

Now you're ready to start coding.

Download the model

Let's start by downloading the RoBERTa model from Hugging Face and save it in the PyTorch TorchScript format (read more about supported model formats).

To save the model from Hugging Face as a TorchScript file, we'll use torch.jit.trace(), which traces the execution of the model using some dummy input.

Create a file named download-model.py and paste this code:

download-model.py
from pathlib import Path
import torch
from transformers import AutoModelForSequenceClassification

# The Hugging Face model name and exported file name
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
MODEL_PATH = Path("roberta.torchscript")


def main():
    # Load the ResNet model from Hugging Face in evaluation mode
    hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)

    # Convert the model to TorchScript
    batch = 1
    seqlen = 128
    input_spec = {
        "input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
        "attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
    }
    with torch.no_grad():
        traced_model = torch.jit.trace(
            hf_model, example_kwarg_inputs=dict(input_spec), strict=False
        )
    torch.jit.save(traced_model, MODEL_PATH)


if __name__ == "__main__":
    main()
from pathlib import Path
import torch
from transformers import AutoModelForSequenceClassification

# The Hugging Face model name and exported file name
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
MODEL_PATH = Path("roberta.torchscript")


def main():
    # Load the ResNet model from Hugging Face in evaluation mode
    hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)

    # Convert the model to TorchScript
    batch = 1
    seqlen = 128
    input_spec = {
        "input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
        "attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
    }
    with torch.no_grad():
        traced_model = torch.jit.trace(
            hf_model, example_kwarg_inputs=dict(input_spec), strict=False
        )
    torch.jit.save(traced_model, MODEL_PATH)


if __name__ == "__main__":
    main()

Now run the file (you must be inside the Magic environment already, via magic shell):
```
python3 download-model.py
```
```
python3 download-model.py
```

You should now see the model saved as roberta.torchscript.

Load the model

Now we can load and compile the TorchScript model in MAX.

Start by creating the executable file called run.py with the required imports and a main() function:

run.py
from pathlib import Path
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from max import engine
from max.dtype import DType

# The Hugging Face model name and TorchScript file name
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
MODEL_PATH = Path("roberta.torchscript")

def main():
    # This is where we'll add our code

if __name__ == "__main__":
    main()
from pathlib import Path
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from max import engine
from max.dtype import DType

# The Hugging Face model name and TorchScript file name
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
MODEL_PATH = Path("roberta.torchscript")

def main():
    # This is where we'll add our code

if __name__ == "__main__":
    main()

In the following sections, we'll add our code to the main() function.

Define TorchScript input specs

Before you can compile a TorchScript model, you need to specify the input names and shapes for each of the model inputs as a list of TorchInputSpec objects. TorchScript files don't provide this information as metadata, which MAX requires when compiling the model.

To declare the input specs for the RoBERTa model, add this code to your main() function in main.py (notice that input_spec is the same specification we used to trace the model in download-model.py):

    batch = 1
    seqlen = 128
    input_spec = {
        "input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
        "attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
    }

    input_spec_list = [
        engine.TorchInputSpec(shape=tensor.size(), dtype=DType.int64)
        for tensor in input_spec.values()
    ]
    batch = 1
    seqlen = 128
    input_spec = {
        "input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
        "attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
    }

    input_spec_list = [
        engine.TorchInputSpec(shape=tensor.size(), dtype=DType.int64)
        for tensor in input_spec.values()
    ]

We'll use this input_spec_list when we load the model next.

If you have a different model with dynamic input shapes, specify None for each dimension size that's dynamic.

Load and compile the model

Now you can instantiate an InferenceSession and load the model:

    session = engine.InferenceSession()
    model = session.load(MODEL_PATH, input_specs=input_spec_list)
    session = engine.InferenceSession()
    model = session.load(MODEL_PATH, input_specs=input_spec_list)

load() requires the input_specs argument only for TorchScript models. It's not used if you load an ONNX model.

Prepare the input

This part is normal input pre-processing. For the RoBERTa model, we need to process the text input into a sequence of tokens, so we'll do that with transformers.AutoTokenizer.

First, let's print the loaded model's inputs:

    for tensor in model.input_metadata:
        print(
            f"name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}"
        )
    for tensor in model.input_metadata:
        print(
            f"name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}"
        )

You can now run the code to see the printed metadata (you must be inside the Magic environment already, via magic shell):

python3 run.py

python3 run.py

name: input_ids, shape: [1, 128], dtype: DType.int64
name: attention_mask, shape: [1, 128], dtype: DType.int64
name: input_ids, shape: [1, 128], dtype: DType.int64
name: attention_mask, shape: [1, 128], dtype: DType.int64

Compile time

The load() function compiles the model with MAX, and this might take a few minutes the first time. However, this up-front cost pays dividends in latency savings provided by our next-generation graph compiler.

This output confirms what we already knew when we specified the input_spec. The model takes two inputs: one tensor of input tokens, and an attention mask.

To create the tokenized sentence and a corresponding mask, we'll use the Transformers tokenizer:

    text_input="There are many exciting developments in the field of AI Infrastructure!"

    tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
    inputs = tokenizer(
        text_input,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=seqlen,
    )
    print(inputs)
    text_input="There are many exciting developments in the field of AI Infrastructure!"

    tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
    inputs = tokenizer(
        text_input,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=seqlen,
    )
    print(inputs)

Run the script again to see the tokenized inputs:

python3 run.py

python3 run.py

{'input_ids': tensor([[    0,   970,    32,   171,  3571,  5126,    11,     5,   882,     9,
          4687, 13469,   328,     2,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])}
{'input_ids': tensor([[    0,   970,    32,   171,  3571,  5126,    11,     5,   882,     9,
          4687, 13469,   328,     2,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])}

Run an inference

To run an inference, pass the inputs to execute(). This function requires all inputs as keyword arguments, so we'll unpack the inputs we got from the tokenizer as we pass it in:

    outputs = model.execute_legacy(**inputs)
    outputs = model.execute_legacy(**inputs)

Starting in 24.6.0, the model.execute() command no longer accepts keyword arguments. In a future release we will restore this functionality with support for GPUs. For compatibility with existing code that uses keyword arguments, you can use the execute_legacy() function.

The output from execute() is a dictionary of output tensors, each in an ndarray.

Process the outputs

To decode the output token ids into labels, we'll again use the transformers library:

    hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)
    predicted_class_id = outputs["result0"]["logits"].argmax(axis=-1)[0]
    classification = hf_model.config.id2label[predicted_class_id]

    print(f"The sentiment is: {classification}")
    hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)
    predicted_class_id = outputs["result0"]["logits"].argmax(axis=-1)[0]
    classification = hf_model.config.id2label[predicted_class_id]

    print(f"The sentiment is: {classification}")

Now run it again to see the result:

python3 run.py

python3 run.py

The sentiment is: joy

The sentiment is: joy

That's it!

To see this finished project, get the code on GitHub.

For more API docs, see the Python API reference.

Create a virtual environment​

Download the model​

Load the model​

Define TorchScript input specs​

Load and compile the model​

Prepare the input​

Run an inference​

Process the outputs​