Skip to main content
Log in

Deploy a model with Amazon SageMaker and AWS CloudFormation

Dave Shevitz

The point of a trained model is to put it to use, to connect its inferencing power to the rest of your application and put its capabilities in the hands of your users.

To help you achieve that goal, we built MAX. MAX includes a state-of-the-art graph compiler and runtime library that executes models from PyTorch and with incredible inference speed on a wide range of hardware.

In this tutorial, you'll explore firsthand how to combine MAX Engine with AWS SageMaker. You'll use MAX Engine to handle inference requests using a previously trained BERT model, and you'll AWS SageMaker to deploy the model. You'll then test the deployment by sending an inference request to an AWS SageMaker endpoint.

About Amazon SageMaker

Amazon SageMaker is a fully managed service provided by Amazon Web Services (AWS) that enables you to build, train, and deploy machine learning models at scale. It simplifies the machine learning workflow by offering integrated tools for every step of the process, from data preparation and model building to training and deployment. SageMaker supports a variety of algorithms and frameworks, making it versatile for different use cases. Additionally, it provides features for model monitoring and automatic scaling, ensuring robust and efficient operations.

About AWS CloudFormation

AWS CloudFormation is a service from AWS that enables you to model, provision, and manage AWS and third-party resources by treating infrastructure as code. It allows you to create and update a collection of related AWS resources in a predictable and orderly fashion through templates. This approach simplifies the orchestration of complex environments, ensuring consistent configuration and deployment. CloudFormation also supports automated rollbacks and dependency management, enhancing reliability and ease of use.

Prerequisites

Before you get started with this tutorial, you should make sure you have the appropriate credentials to log into your AWS account. In addition, you need to have an Identity and Access Management (IAM) role and policy that allows you to create and deploy resources using AWS SageMaker.

Build the AWS CloudFormation stack

Your first step is to use a previously-created AWS CloudFormation template to define the various AWS resources you need. A set of resources built from CloudFormation is referred to as a stack.

  1. Sign in to the AWS Console.

  2. In a separate browser tab, open this link to create a stack using our example template

  3. Check the I acknowledge that AWS CloudFormation might create IAM resources checkbox.

  4. Click Create stack.

    AWS builds out the resources defined in the CloudFormation template. This process takes up to 10 minutes to complete.

    When AWS finishes building the stack, it displays an event in the Events tab that says CREATE_COMPLETE.

  5. Click the Outputs tab and copy the EndpointName value.

Test the deployment endpoint

At this point, you have now created a deployment that connects a model using MAX Engine to AWS SageMaker. This deployment includes a number of AWS compute and network resources that AWS SageMaker creates automatically to handle inferencing requests. To test this deployment, you'll create a small Python application to send an inferencing request to an AWS SageMaker endpoint, then process and display the response.

  1. Open a terminal.

  2. Sign in to AWS.

    aws sso login
    aws sso login
  3. Create a Python virtual environment and install the required dependencies.

    python3 -m venv max-aws-deploy && source venv/bin/activate
    python3 -m venv max-aws-deploy && source venv/bin/activate
    pip install boto3 transformers
    pip install boto3 transformers
    pip install torch
    pip install torch
  4. Create a file called client.py and paste in the following code.

    If you didn't write down the endpoint_name, you can find it by opening your AWS Console and selecting CloudFormation, then clicking the Outputs tab.

    # suppress extraneous logging
    import os
    os.environ["TRANSFORMERS_VERBOSITY"] = "critical"

    import json
    import boto3
    import transformers
    from botocore.config import Config
    import numpy as np

    config = Config(region_name="us-east-1")
    client = boto3.client("sagemaker-runtime", config=config)

    # NOTE: Paste your endpoint here
    endpoint_name = "YOUR-ENDPOINT-GOES-HERE"

    text = "The quick brown fox jumped over the lazy dog."

    tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
    inputs = tokenizer(text, padding="max_length", max_length=128, return_tensors="pt")

    # Convert tensor inputs to list for payload
    input_ids = inputs["input_ids"].tolist()[0]
    attention_mask = inputs["attention_mask"].tolist()[0]
    token_type_ids = inputs["token_type_ids"].tolist()[0]

    payload = {
    "inputs": [
    {
    "name": "input_ids",
    "shape": [1, 128],
    "datatype": "INT32",
    "data": input_ids,
    },
    {
    "name": "attention_mask",
    "shape": [1, 128],
    "datatype": "INT32",
    "data": attention_mask,
    },
    {
    "name": "token_type_ids",
    "shape": [1, 128],
    "datatype": "INT32",
    "data": token_type_ids,
    },
    ]
    }

    http_response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(payload),
    )
    response = json.loads(http_response["Body"].read().decode("utf8"))
    outputs = response["outputs"]

    def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits))
    return exp_logits / exp_logits.sum(axis=-1, keepdims=True)

    # Process the output
    for output in outputs:
    logits = output['data']
    logits = np.array(logits).reshape(output['shape'])

    print(f"Logits shape: {logits.shape}")

    if len(logits.shape) == 3: # Shape [batch_size, sequence_length, num_classes]
    token_probabilities = softmax(logits)
    predicted_classes = np.argmax(token_probabilities, axis=-1)

    print(f"Predicted classes shape: {predicted_classes.shape}")
    print(f"Predicted class indices range: {np.min(predicted_classes)}, {np.max(predicted_classes)}")

    # Map predicted indices to tokens
    predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_classes[0])

    # Pair each input token with its predicted token
    input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    token_pairs = list(zip(input_tokens, predicted_tokens))

    print("Predicted Token Pairs:")
    print("-" * 45)
    print("| {:<20} | {:<18} |".format("Input Token", "Predicted Token"))
    print("-" * 45)
    for input_token, predicted_token in token_pairs:
    if input_token != '[PAD]': # Exclude padding tokens
    print("| {:<20} | {:<18} |".format(input_token, predicted_token))
    print("-" * 45)
    # suppress extraneous logging
    import os
    os.environ["TRANSFORMERS_VERBOSITY"] = "critical"

    import json
    import boto3
    import transformers
    from botocore.config import Config
    import numpy as np

    config = Config(region_name="us-east-1")
    client = boto3.client("sagemaker-runtime", config=config)

    # NOTE: Paste your endpoint here
    endpoint_name = "YOUR-ENDPOINT-GOES-HERE"

    text = "The quick brown fox jumped over the lazy dog."

    tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
    inputs = tokenizer(text, padding="max_length", max_length=128, return_tensors="pt")

    # Convert tensor inputs to list for payload
    input_ids = inputs["input_ids"].tolist()[0]
    attention_mask = inputs["attention_mask"].tolist()[0]
    token_type_ids = inputs["token_type_ids"].tolist()[0]

    payload = {
    "inputs": [
    {
    "name": "input_ids",
    "shape": [1, 128],
    "datatype": "INT32",
    "data": input_ids,
    },
    {
    "name": "attention_mask",
    "shape": [1, 128],
    "datatype": "INT32",
    "data": attention_mask,
    },
    {
    "name": "token_type_ids",
    "shape": [1, 128],
    "datatype": "INT32",
    "data": token_type_ids,
    },
    ]
    }

    http_response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(payload),
    )
    response = json.loads(http_response["Body"].read().decode("utf8"))
    outputs = response["outputs"]

    def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits))
    return exp_logits / exp_logits.sum(axis=-1, keepdims=True)

    # Process the output
    for output in outputs:
    logits = output['data']
    logits = np.array(logits).reshape(output['shape'])

    print(f"Logits shape: {logits.shape}")

    if len(logits.shape) == 3: # Shape [batch_size, sequence_length, num_classes]
    token_probabilities = softmax(logits)
    predicted_classes = np.argmax(token_probabilities, axis=-1)

    print(f"Predicted classes shape: {predicted_classes.shape}")
    print(f"Predicted class indices range: {np.min(predicted_classes)}, {np.max(predicted_classes)}")

    # Map predicted indices to tokens
    predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_classes[0])

    # Pair each input token with its predicted token
    input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    token_pairs = list(zip(input_tokens, predicted_tokens))

    print("Predicted Token Pairs:")
    print("-" * 45)
    print("| {:<20} | {:<18} |".format("Input Token", "Predicted Token"))
    print("-" * 45)
    for input_token, predicted_token in token_pairs:
    if input_token != '[PAD]': # Exclude padding tokens
    print("| {:<20} | {:<18} |".format(input_token, predicted_token))
    print("-" * 45)
  5. Run the script.

    python client.py
    python client.py

You should see output similar to the following.

Logits shape: (1, 128, 30522)
Predicted classes shape: (1, 128)
Predicted class indices range: 1010, 13971
Predicted Token Pairs:
---------------------------------------------
| Input Token | Predicted Token |
---------------------------------------------
| [CLS] | . |
| the | the |
| quick | quick |
| brown | brown |
| fox | fox |
| jumped | jumped |
| over | over |
| the | the |
| lazy | lazy |
| dog | dog |
| . | . |
| [SEP] | . |
---------------------------------------------
Logits shape: (1, 128, 30522)
Predicted classes shape: (1, 128)
Predicted class indices range: 1010, 13971
Predicted Token Pairs:
---------------------------------------------
| Input Token | Predicted Token |
---------------------------------------------
| [CLS] | . |
| the | the |
| quick | quick |
| brown | brown |
| fox | fox |
| jumped | jumped |
| over | over |
| the | the |
| lazy | lazy |
| dog | dog |
| . | . |
| [SEP] | . |
---------------------------------------------

Clean up

That's it! You've now deployed a model using MAX Engine, AWS CloudFormation, and Amazon SageMaker! To avoid incurring additional costs for AWS resources, we recommend you delete the resources you've built.

To delete tutorial resources:

  1. From the CloudFormation console, select Stacks.
  2. Select the stack that you created for this tutorial.
  3. Click Delete.

Next steps

In this tutorial, you've leveraged an AWS CloudFormation template to build out a complete AWS SageMaker deployment. This deployment used MAX engine to handle inference requests for a BERT model. The deployment took a text input, analyzed each token in the input, and returned what the model predicted the next token would be.

We encourage you to use what you learned here to deploy other models, and extend this tutorial as needed to explore other MAX features.

Here are some other topics to explore next:

Did this tutorial work for you?