The point of a trained model is to put it to use, to connect its inferencing power to the rest of your application and put its capabilities in the hands of your users.
To help you achieve that goal, we built MAX. MAX includes a state-of-the-art graph compiler and runtime library that executes models from PyTorch and with incredible inference speed on a wide range of hardware.
In this tutorial, you'll explore firsthand how to combine MAX Engine with AWS SageMaker. You'll use MAX Engine to handle inference requests using a previously trained BERT model, and you'll AWS SageMaker to deploy the model. You'll then test the deployment by sending an inference request to an AWS SageMaker endpoint.
About Amazon SageMaker
Amazon SageMaker is a fully managed service provided by Amazon Web Services (AWS) that enables you to build, train, and deploy machine learning models at scale. It simplifies the machine learning workflow by offering integrated tools for every step of the process, from data preparation and model building to training and deployment. SageMaker supports a variety of algorithms and frameworks, making it versatile for different use cases. Additionally, it provides features for model monitoring and automatic scaling, ensuring robust and efficient operations.
About AWS CloudFormation
AWS CloudFormation is a service from AWS that enables you to model, provision, and manage AWS and third-party resources by treating infrastructure as code. It allows you to create and update a collection of related AWS resources in a predictable and orderly fashion through templates. This approach simplifies the orchestration of complex environments, ensuring consistent configuration and deployment. CloudFormation also supports automated rollbacks and dependency management, enhancing reliability and ease of use.
Prerequisites
Before you get started with this tutorial, you should make sure you have the appropriate credentials to log into your AWS account. In addition, you need to have an Identity and Access Management (IAM) role and policy that allows you to create and deploy resources using AWS SageMaker.
Build the AWS CloudFormation stack
Your first step is to use a previously-created AWS CloudFormation template to define the various AWS resources you need. A set of resources built from CloudFormation is referred to as a stack.
-
Sign in to the AWS Console.
-
In a separate browser tab, open this link to create a stack using our example template
-
Check the I acknowledge that AWS CloudFormation might create IAM resources checkbox.
-
Click Create stack.
AWS builds out the resources defined in the CloudFormation template. This process takes up to 10 minutes to complete.
When AWS finishes building the stack, it displays an event in the Events tab that says
CREATE_COMPLETE
. -
Click the Outputs tab and copy the EndpointName value.
Test the deployment endpoint
At this point, you have now created a deployment that connects a model using MAX Engine to AWS SageMaker. This deployment includes a number of AWS compute and network resources that AWS SageMaker creates automatically to handle inferencing requests. To test this deployment, you'll create a small Python application to send an inferencing request to an AWS SageMaker endpoint, then process and display the response.
-
Open a terminal.
-
Sign in to AWS.
aws sso login
aws sso login
-
Create a Python virtual environment and install the required dependencies.
python3 -m venv max-aws-deploy && source venv/bin/activate
python3 -m venv max-aws-deploy && source venv/bin/activate
pip install boto3 transformers
pip install boto3 transformers
pip install torch
pip install torch
-
Create a file called
client.py
and paste in the following code.cautionIf you didn't write down the
endpoint_name
, you can find it by opening your AWS Console and selecting CloudFormation, then clicking the Outputs tab.# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
import json
import boto3
import transformers
from botocore.config import Config
import numpy as np
config = Config(region_name="us-east-1")
client = boto3.client("sagemaker-runtime", config=config)
# NOTE: Paste your endpoint here
endpoint_name = "YOUR-ENDPOINT-GOES-HERE"
text = "The quick brown fox jumped over the lazy dog."
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, padding="max_length", max_length=128, return_tensors="pt")
# Convert tensor inputs to list for payload
input_ids = inputs["input_ids"].tolist()[0]
attention_mask = inputs["attention_mask"].tolist()[0]
token_type_ids = inputs["token_type_ids"].tolist()[0]
payload = {
"inputs": [
{
"name": "input_ids",
"shape": [1, 128],
"datatype": "INT32",
"data": input_ids,
},
{
"name": "attention_mask",
"shape": [1, 128],
"datatype": "INT32",
"data": attention_mask,
},
{
"name": "token_type_ids",
"shape": [1, 128],
"datatype": "INT32",
"data": token_type_ids,
},
]
}
http_response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/octet-stream",
Body=json.dumps(payload),
)
response = json.loads(http_response["Body"].read().decode("utf8"))
outputs = response["outputs"]
def softmax(logits):
exp_logits = np.exp(logits - np.max(logits))
return exp_logits / exp_logits.sum(axis=-1, keepdims=True)
# Process the output
for output in outputs:
logits = output['data']
logits = np.array(logits).reshape(output['shape'])
print(f"Logits shape: {logits.shape}")
if len(logits.shape) == 3: # Shape [batch_size, sequence_length, num_classes]
token_probabilities = softmax(logits)
predicted_classes = np.argmax(token_probabilities, axis=-1)
print(f"Predicted classes shape: {predicted_classes.shape}")
print(f"Predicted class indices range: {np.min(predicted_classes)}, {np.max(predicted_classes)}")
# Map predicted indices to tokens
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_classes[0])
# Pair each input token with its predicted token
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
token_pairs = list(zip(input_tokens, predicted_tokens))
print("Predicted Token Pairs:")
print("-" * 45)
print("| {:<20} | {:<18} |".format("Input Token", "Predicted Token"))
print("-" * 45)
for input_token, predicted_token in token_pairs:
if input_token != '[PAD]': # Exclude padding tokens
print("| {:<20} | {:<18} |".format(input_token, predicted_token))
print("-" * 45)# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
import json
import boto3
import transformers
from botocore.config import Config
import numpy as np
config = Config(region_name="us-east-1")
client = boto3.client("sagemaker-runtime", config=config)
# NOTE: Paste your endpoint here
endpoint_name = "YOUR-ENDPOINT-GOES-HERE"
text = "The quick brown fox jumped over the lazy dog."
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, padding="max_length", max_length=128, return_tensors="pt")
# Convert tensor inputs to list for payload
input_ids = inputs["input_ids"].tolist()[0]
attention_mask = inputs["attention_mask"].tolist()[0]
token_type_ids = inputs["token_type_ids"].tolist()[0]
payload = {
"inputs": [
{
"name": "input_ids",
"shape": [1, 128],
"datatype": "INT32",
"data": input_ids,
},
{
"name": "attention_mask",
"shape": [1, 128],
"datatype": "INT32",
"data": attention_mask,
},
{
"name": "token_type_ids",
"shape": [1, 128],
"datatype": "INT32",
"data": token_type_ids,
},
]
}
http_response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/octet-stream",
Body=json.dumps(payload),
)
response = json.loads(http_response["Body"].read().decode("utf8"))
outputs = response["outputs"]
def softmax(logits):
exp_logits = np.exp(logits - np.max(logits))
return exp_logits / exp_logits.sum(axis=-1, keepdims=True)
# Process the output
for output in outputs:
logits = output['data']
logits = np.array(logits).reshape(output['shape'])
print(f"Logits shape: {logits.shape}")
if len(logits.shape) == 3: # Shape [batch_size, sequence_length, num_classes]
token_probabilities = softmax(logits)
predicted_classes = np.argmax(token_probabilities, axis=-1)
print(f"Predicted classes shape: {predicted_classes.shape}")
print(f"Predicted class indices range: {np.min(predicted_classes)}, {np.max(predicted_classes)}")
# Map predicted indices to tokens
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_classes[0])
# Pair each input token with its predicted token
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
token_pairs = list(zip(input_tokens, predicted_tokens))
print("Predicted Token Pairs:")
print("-" * 45)
print("| {:<20} | {:<18} |".format("Input Token", "Predicted Token"))
print("-" * 45)
for input_token, predicted_token in token_pairs:
if input_token != '[PAD]': # Exclude padding tokens
print("| {:<20} | {:<18} |".format(input_token, predicted_token))
print("-" * 45) -
Run the script.
python client.py
python client.py
You should see output similar to the following.
Logits shape: (1, 128, 30522)
Predicted classes shape: (1, 128)
Predicted class indices range: 1010, 13971
Predicted Token Pairs:
---------------------------------------------
| Input Token | Predicted Token |
---------------------------------------------
| [CLS] | . |
| the | the |
| quick | quick |
| brown | brown |
| fox | fox |
| jumped | jumped |
| over | over |
| the | the |
| lazy | lazy |
| dog | dog |
| . | . |
| [SEP] | . |
---------------------------------------------
Logits shape: (1, 128, 30522)
Predicted classes shape: (1, 128)
Predicted class indices range: 1010, 13971
Predicted Token Pairs:
---------------------------------------------
| Input Token | Predicted Token |
---------------------------------------------
| [CLS] | . |
| the | the |
| quick | quick |
| brown | brown |
| fox | fox |
| jumped | jumped |
| over | over |
| the | the |
| lazy | lazy |
| dog | dog |
| . | . |
| [SEP] | . |
---------------------------------------------
Clean up
That's it! You've now deployed a model using MAX Engine, AWS CloudFormation, and Amazon SageMaker! To avoid incurring additional costs for AWS resources, we recommend you delete the resources you’ve built.
To delete tutorial resources:
- From the CloudFormation console, select Stacks.
- Select the stack that you created for this tutorial.
- Click Delete.
Next steps
In this tutorial, you've leveraged an AWS CloudFormation template to build out a complete AWS SageMaker deployment. This deployment used MAX engine to handle inference requests for a BERT model. The deployment took a text input, analyzed each token in the input, and returned what the model predicted the next token would be.
We encourage you to use what you learned here to deploy other models, and extend this tutorial as needed to explore other MAX features.
Here are some other topics to explore next: