import numpy as np
from pathlib import Path
from bert_utils import convert_to_tokens, convert_to_string
Inference Engine Python demo
This is a preview of the Modular Inference Engine. It is not publicly available yet and APIs are subject to change.
If you’re interested, please sign up for early access.
The Modular Inference Engine is the world’s fastest unified inference engine, designed to run any TensorFlow or PyTorch model on any hardware backend.
This page is built from the same Jupyter notebook that Nick Kreeger presented in our launch keynote video, in which he shows how easy it is to load trained models from TensorFlow and PyTorch and run them with our Python API on a variety of CPU backends. We’re sharing this executed version of the notebook so you can look closely at the code from the video.
Below, you can see how we load one TensorFlow model (a BERT model) and one PyTorch model (a DLRM model) into the Modular Inference Engine, and then print some model metadata and execute each one.
Notebook code
TensorFlow BERT-Base Model
= Path("models/tensorflow/bert") tf_bert_model
PyTorch DLRM Recommender Model
= Path("models/pytorch/dlrm.pt") pt_dlrm_model
Virtual Machine Information in AWS
Check the concrete machine configuration.
print("="*40, "Processor Information", "="*40, "\n")
!lscpu | grep "Model name"
!lscpu | grep Architecture
======================================== Processor Information ========================================
Model name: Neoverse-N1
Architecture: aarch64
Import the Modular Python API
By default, the inference engine is small and very low dependency. It will automatically load TensorFlow and PyTorch dependencies when needed.
from modular import engine
= engine.InferenceSession() session
Load and initialize both the TensorFlow and PyTorch models
This process handles loading all framework dependencies for you; the models are ready for inference once loaded.
= session.load(tf_bert_model)
tf_bert_session = session.load(pt_dlrm_model) pt_dlrm_session
Run inference on both the TensorFlow BERT and PyTorch DLRM Models.
The Modular Python API works great with other libraries such as numpy to enable easy input to models.
# Run BERT TensorFlow model with a given question.
= "When did Copenhagen become the capital of Denmark?"
question = convert_to_tokens(question)
attention_mask, input_ids, token_type_ids = tf_bert_session.execute(attention_mask, input_ids, token_type_ids)
bert_outputs
# Perform DLRM PyTorch model with random one-hot row of suggested items and features.
= np.random.rand(4, 8, 100).astype(np.int32)
recommended_items = np.random.rand(4, 256).astype(np.float32)
dense_features = pt_dlrm_session.execute(dense_features, recommended_items) dlrm_outputs
Inspecting the output of BERT
The Modular Python API provides access to shapes, dtypes, and tensor output values. This example takes the outputs from BERT and converts the output tokens to strings.
print("Number of output tensors:", len(bert_outputs))
print(bert_outputs[0].shape, bert_outputs[0].dtype)
print(bert_outputs[1].shape, bert_outputs[1].dtype)
print("Answer:", convert_to_string(input_ids, bert_outputs))
Number of output tensors: 2
(1, 192) float32
(1, 192) float32
Answer: Copenhagen became the capital of Denmark in the early 15th century
Inspecting the output of DLRM
As with the example above, the PyTorch DLRM model output has the same API for accessing inference results.
print("Number of output tensors:", len(dlrm_outputs))
print(dlrm_outputs[0].shape, dlrm_outputs[0].dtype)
= ["dog", "cat", "rabbit", "snake"]
dlrm_suggested_items = dlrm_outputs[0].argmax()
dlrm_recommended_index print("Recommend item index:", dlrm_outputs[0].argmax())
print("Recommend item:", dlrm_suggested_items[dlrm_recommended_index])
Number of output tensors: 1
(4, 1) float32
Recommend item index: 1
Recommend item: cat