The point of a trained model is to put it to use, to connect its inferencing power to the rest of your application and put its capabilities in the hands of your users.
To help you achieve that goal, we built MAX. MAX includes a state-of-the-art graph compiler and runtime library that executes models from PyTorch and with incredible inference speed on a wide range of hardware.
In this tutorial, you'll explore firsthand how to combine MAX Engine with AWS SageMaker. You'll use MAX Engine to handle inference requests using a previously trained BERT model, and you'll AWS SageMaker to deploy the model. You'll then test the deployment by sending an inference request to an AWS SageMaker endpoint.
If you experience any issues in this tutorial, please let us know on GitHub.
About Amazon SageMakerβ
Amazon SageMaker is a fully managed service provided by Amazon Web Services (AWS) that enables you to build, train, and deploy machine learning models at scale. It simplifies the machine learning workflow by offering integrated tools for every step of the process, from data preparation and model building to training and deployment. SageMaker supports a variety of algorithms and frameworks, making it versatile for different use cases. Additionally, it provides features for model monitoring and automatic scaling, ensuring robust and efficient operations.
About AWS CloudFormationβ
AWS CloudFormation is a service from AWS that enables you to model, provision, and manage AWS and third-party resources by treating infrastructure as code. It allows you to create and update a collection of related AWS resources in a predictable and orderly fashion through templates. This approach simplifies the orchestration of complex environments, ensuring consistent configuration and deployment. CloudFormation also supports automated rollbacks and dependency management, enhancing reliability and ease of use.
Prerequisitesβ
Before you get started with this tutorial, you should make sure you have the appropriate credentials to log into your AWS account. In addition, you need to have an Identity and Access Management (IAM) role and policy that allows you to create and deploy resources using AWS SageMaker.
Build the AWS CloudFormation stackβ
Your first step is to use a previously-created AWS CloudFormation template to define the various AWS resources you need. A set of resources built from CloudFormation is referred to as a stack.
-
Sign in to the AWS Console.
-
In a separate browser tab, open this link to create a stack using our example template
-
Check the I acknowledge that AWS CloudFormation might create IAM resources checkbox.
-
Click Create stack.
AWS builds out the resources defined in the CloudFormation template. This process takes up to 10 minutes to complete.
When AWS finishes building the stack, it displays an event in the Events tab that says
CREATE_COMPLETE
. -
Click the Outputs tab and copy the EndpointName value.
Test the deployment endpointβ
At this point, you have now created a deployment that connects a model using MAX Engine to AWS SageMaker. This deployment includes a number of AWS compute and network resources that AWS SageMaker creates automatically to handle inferencing requests. To test this deployment, you'll create a small Python application to send an inferencing request to an AWS SageMaker endpoint, then process and display the response.
In this tutorial you need to sign into AWS using the aws cli
tool. This
step is necessary because the AWS SageMaker configuration you've created does
not expose the endpoint_name
to the internet.
To sign in to AWS from the command line, we recommend you use the AWS SSO token
provider configuration. You can create this configuration by running aws configure sso
. This command requires an SSO Start Url
and an SSO Region
.
The values for these parameters depends on your AWS configuration. To learn
more, see Configure the AWS CLI to use AWS IAM Identity
Center.
-
Open a terminal.
-
Sign in to AWS.
aws sso login
-
Create a Python virtual environment and install the required dependencies.
python3 -m venv max-aws-deploy && source venv/bin/activate
pip install boto3 transformers
pip install torch
-
Create a file called
client.py
and paste in the following code.cautionMake sure to update the
endpoint_name
variable with the name of your actual endpoint.If you didn't write down the
endpoint_name
, you can find it by opening your AWS Console and selecting CloudFormation, then clicking the Outputs tab.# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
import json
import boto3
import transformers
from botocore.config import Config
import numpy as np
config = Config(region_name="us-east-1")
client = boto3.client("sagemaker-runtime", config=config)
# NOTE: Paste your endpoint here
endpoint_name = "YOUR-ENDPOINT-GOES-HERE"
text = "The quick brown fox jumped over the lazy dog."
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, padding="max_length", max_length=128, return_tensors="pt")
# Convert tensor inputs to list for payload
input_ids = inputs["input_ids"].tolist()[0]
attention_mask = inputs["attention_mask"].tolist()[0]
token_type_ids = inputs["token_type_ids"].tolist()[0]
payload = {
"inputs": [
{
"name": "input_ids",
"shape": [1, 128],
"datatype": "INT32",
"data": input_ids,
},
{
"name": "attention_mask",
"shape": [1, 128],
"datatype": "INT32",
"data": attention_mask,
},
{
"name": "token_type_ids",
"shape": [1, 128],
"datatype": "INT32",
"data": token_type_ids,
},
]
}
http_response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/octet-stream",
Body=json.dumps(payload),
)
response = json.loads(http_response["Body"].read().decode("utf8"))
outputs = response["outputs"]
def softmax(logits):
exp_logits = np.exp(logits - np.max(logits))
return exp_logits / exp_logits.sum(axis=-1, keepdims=True)
# Process the output
for output in outputs:
logits = output['data']
logits = np.array(logits).reshape(output['shape'])
print(f"Logits shape: {logits.shape}")
if len(logits.shape) == 3: # Shape [batch_size, sequence_length, num_classes]
token_probabilities = softmax(logits)
predicted_classes = np.argmax(token_probabilities, axis=-1)
print(f"Predicted classes shape: {predicted_classes.shape}")
print(f"Predicted class indices range: {np.min(predicted_classes)}, {np.max(predicted_classes)}")
# Map predicted indices to tokens
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_classes[0])
# Pair each input token with its predicted token
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
token_pairs = list(zip(input_tokens, predicted_tokens))
print("Predicted Token Pairs:")
print("-" * 45)
print("| {:<20} | {:<18} |".format("Input Token", "Predicted Token"))
print("-" * 45)
for input_token, predicted_token in token_pairs:
if input_token != '[PAD]': # Exclude padding tokens
print("| {:<20} | {:<18} |".format(input_token, predicted_token))
print("-" * 45) -
Run the script.
python client.py
You should see output similar to the following.
Logits shape: (1, 128, 30522)
Predicted classes shape: (1, 128)
Predicted class indices range: 1010, 13971
Predicted Token Pairs:
---------------------------------------------
| Input Token | Predicted Token |
---------------------------------------------
| [CLS] | . |
| the | the |
| quick | quick |
| brown | brown |
| fox | fox |
| jumped | jumped |
| over | over |
| the | the |
| lazy | lazy |
| dog | dog |
| . | . |
| [SEP] | . |
---------------------------------------------
Clean upβ
That's it! You've now deployed a model using MAX Engine, AWS CloudFormation, and Amazon SageMaker! To avoid incurring additional costs for AWS resources, we recommend you delete the resources youβve built.
To delete tutorial resources:
- From the CloudFormation console, select Stacks.
- Select the stack that you created for this tutorial.
- Click Delete.
Next stepsβ
In this tutorial, you've leveraged an AWS CloudFormation template to build out a complete AWS SageMaker deployment. This deployment used MAX engine to handle inference requests for a BERT model. The deployment took a text input, analyzed each token in the input, and returned what the model predicted the next token would be.
We encourage you to use what you learned here to deploy other models, and extend this tutorial as needed to explore other MAX features.
Here are some other topics to explore next:
Deploy a model with Kubernetes and Helm
Learn how to deploy a model using MAX Engine and Kubernetes.
Modular pricing
Learn about the licensing and support options for developers and enterprises.
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!
If you'd like to share more information, please report an issue on GitHub
π What went wrong?