Skip to main content
Log in

Deploy a PyTorch model from Hugging Face

Ehsan M. Kermani
Dave Shevitz

We designed MAX to simplify the entire AI development workflow, and that includes deploying PyTorch models with a high-performance serving endpoint. As we'll show you in this tutorial, deploying an endpoint with MAX is as simple as deploying a Docker container—you don't have to write any new code to use MAX.

Currently, the MAX container includes a REST API that supports large-language models (LLMs) only, so that's what we'll deploy. Specifically, we'll deploy the Qwen2.5 model, but you can select a different PyTorch LLM from Hugging Face. (See our README for a list of model architectures we currently support.) We've also included instructions to deploy to the cloud provider of your choice, either AWS, GCP, or Azure.

If you want to instead deploy a highly-optimized LLM built with MAX, see Deploy Llama 3 with MAX Serve on GPU.

Deploy to a local endpoint

In this section, you'll download the MAX repository, then use MAX to serve the Qwen2.5 model.

Prerequisites

Before following the steps in this topic, make sure you've downloaded and installed the Hugging Face CLI.

Use the huggingface-cli login command to authenticate.

huggingface-cli login
huggingface-cli login

This command requests an authentication token. If you don't have one already, Hugging Face's User access tokens topic explains how to create one.

Clone the repository

To start, clone the max repository.

git clone git@github.com:modularml/max.git
git clone git@github.com:modularml/max.git

After you clone the repository, navigate to the max/pipelines/python directory.

cd max/pipelines/python/
cd max/pipelines/python/

Serve the endpoint

Let's get our endpoint up and running!

We'll use the magic CLI to create a virtual environment and install the required packages.

If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash

Then run the source command that's printed in your terminal.

The command we'll run is magic run serve, which takes supports a huggingface-repo-id parameter. This parameter allows you to specify an LLM model hosted on Hugging Face.

magic run serve --huggingface-repo-id=Qwen/Qwen2.5-1.5B-Instruct
magic run serve --huggingface-repo-id=Qwen/Qwen2.5-1.5B-Instruct

This command downloads the model and sets up MAX Serve to host a local endpoint.

The endpoint is ready when you see output similar to the following in your terminal:

uvicorn running on http://0.0.0.0:8000
uvicorn running on http://0.0.0.0:8000

Test the endpoint

Let's test the endpoint! In a new terminal run the following curl command:

curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia?"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

You should see a response in your command line similar to the following:

The capital city of Mongolia is Ulaanbaatar.
The capital city of Mongolia is Ulaanbaatar.

That's it! In just a few steps, you've connected a Hugging Face LLM model to an endpoint so it can receive and respond to inference requests.

Deploy to a cloud provider

In the first part of this tutorial, you used MAX to deploy a Hugging Face model to a local endpoint. In this next part, you'll use a prebuilt Docker container to deploy a model to a cloud provider.

Prerequisites

This tutorial shows you how to deploy a model to one of three cloud providers:

  • AWS
  • GCP
  • Azure

To complete this tutorial, you should:

  • Be familiar with the basics of at least one of these cloud providers
  • Have the appropriate CLI tools installed:
  • Have a project set up that you can use to deploy the Docker container.
  • Verify that you have access to the Qwen2.5 model.
  • Enable any billing permissions so you can install the appropriate APIs and launch the designated GPU instances.

Initialize CLI tools

If you haven't already done so, make sure that you've initialized your CLI tools and logged in to your account.

Configure the AWS CLI:

aws configure
aws configure

Login to your AWS account:

aws sso login
aws sso login

Check the credentials via cat ~/.aws/credentials to make sure it is set up correctly. You can also include the credentials as environment variables:

export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"

Create your deployment

In this section, you'll go through the steps needed to create a deployment. These steps vary depending on the Cloud provider you prefer to use.

For AWS, we'll create a AWS CloudFormation template to define and configure our deployment.

  1. Create a working directory for the Infrastructure as Code files.

    mkdir aws
    mkdir aws

    Then, navigate to that directory.

    cd aws
    cd aws
  2. Set the AWS region. In this case, we'll use us-east-1, but you can use whatever region you prefer.

    export REGION="us-east-1"
    export REGION="us-east-1"
  3. Create an AWS CloudFormation file, max-serve-aws.yaml.

    touch max-serve-aws.yaml
    touch max-serve-aws.yaml

    Then, using the editor of your choice, paste the following:

  4. Create the stack.

    aws cloudformation create-stack --stack-name max-serve-stack \
    --template-body file://max-serve-aws.yaml \
    --parameters ParameterKey=InstanceType,ParameterValue=p4d.24xlarge \
    ParameterKey=HuggingFaceHubToken,ParameterValue=<YOUR_HUGGING_FACE_HUB_TOKEN> \
    ParameterKey=HuggingFaceRepoId,ParameterValue=Qwen/Qwen2.5-1.5b-instruct \
    --capabilities CAPABILITY_IAM \
    --region $REGION
    aws cloudformation create-stack --stack-name max-serve-stack \
    --template-body file://max-serve-aws.yaml \
    --parameters ParameterKey=InstanceType,ParameterValue=p4d.24xlarge \
    ParameterKey=HuggingFaceHubToken,ParameterValue=<YOUR_HUGGING_FACE_HUB_TOKEN> \
    ParameterKey=HuggingFaceRepoId,ParameterValue=Qwen/Qwen2.5-1.5b-instruct \
    --capabilities CAPABILITY_IAM \
    --region $REGION

    Note that you must replace <YOUR_HUGGING_FACE_HUB_TOKEN> with your actual token.

    In addition, this command defines the model that we want to deploy. For this tutorial, we'll use the Qwen2.5 model.

    This deployment can take a few minutes to complete. Track the status of the deployment by running the following command:

    aws cloudformation describe-stacks --stack-name max-serve-stack \
    --region $REGION --query 'Stacks[0].StackStatus' --output text
    aws cloudformation describe-stacks --stack-name max-serve-stack \
    --region $REGION --query 'Stacks[0].StackStatus' --output text

    When the CloudFormation stack is deployed, you should see a status of CREATE_COMPLETE. Type q to exit this prompt in your CLI.

Retrieve instance information

At this point, you should have confirmation that your instance is up and running! Let's get some of the information we need to test the deployment.

Let's get the instance ID and public IP address and assign them to environment variables:

INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name max-serve-stack --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" --output text --region $REGION)
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].PublicIpAddress' --output text --region $REGION)
echo "Instance ID: $INSTANCE_ID"
echo "Public IP: $PUBLIC_IP"
aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name max-serve-stack --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" --output text --region $REGION)
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].PublicIpAddress' --output text --region $REGION)
echo "Instance ID: $INSTANCE_ID"
echo "Public IP: $PUBLIC_IP"
aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION

Test the endpoint

We've confirmed that the instance is available. However, it can still take a few minutes to pull the MAX Docker image and start it. In this section, you'll learn how to check to see if the service is ready to receive inference requests, then run a curl command to send and receive a request to the container.

To track when the instance is ready, you can use the AWS CloudWatch console to view the log group, /aws/ec2/max-serve-stack-logs and find the logs for instance-logs. Alternatively, you can use the following bash script:

The instance is ready when you can see a log entry similar to the following:

Uvicorn running on http://0.0.0.0:8000
Uvicorn running on http://0.0.0.0:8000

After you see this log entry, you can test the endpoint by running the following curl command:

curl -N http://$PUBLIC_IP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5b-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://$PUBLIC_IP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5b-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Mongolia"}
]
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

You should see a response in your command line similar to the following:

The capital city of Mongolia is Ulaanbaatar.
The capital city of Mongolia is Ulaanbaatar.

Delete the cloud resources

Take a few minutes to explore your deployment. When you're finished, be sure to delete the resources created in this tutorial so you don't incur any unnecessary charges.

  1. Delete the stack.

    aws cloudformation delete-stack --stack-name max-serve-stack
    aws cloudformation delete-stack --stack-name max-serve-stack
  2. Verify that the stack deleted successfully.

    aws cloudformation describe-stacks --stack-name max-serve-stack \
    --region $REGION --query 'Stacks[0].StackStatus' --output text
    aws cloudformation describe-stacks --stack-name max-serve-stack \
    --region $REGION --query 'Stacks[0].StackStatus' --output text

Next steps

In this tutorial, you've deployed a Hugging Face Pytorch model to the cloud using a MAX Docker container.

Keep in mind that this is just a preview of MAX Serve for PyTorch models and it's currently compatible with LLMs only. We're working on support for more models and more model optimizations with the MAX graph compiler.

Here are some other topics to explore next:

To stay up to date with new releases, sign up for our newsletter and join our community. And if you're interested in becoming a design partner to get early access and give us feedback, please contact us.

Did this tutorial work for you?