
Generate image descriptions with Llama 3.2 Vision
MAX (Modular Accelerated Xecution) now supports multimodal models, simplifying the deployment of AI systems that handle both text and images. You can now serve models like Llama 3.2 11B Vision Instruct, which excels at tasks such as image captioning and visual question answering. This guide walks you through installing the necessary tools, configuring access, and serving the model with MAX.
Install max-pipelines
We'll use the max-pipelines
CLI tool to create a local
endpoint.
-
If you don't have the
magic
CLI yet, you can install it on macOS and Ubuntu Linux with this command:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. -
Install
max-pipelines
:magic global install max-pipelines
magic global install max-pipelines
Configure Hugging Face access
To download and use Llama 3.2 11B Vision Instruct from Hugging Face, you must have a Hugging Face account, a Hugging Face user access token, and access to the Llama 3.2 11B Vision Instruct Hugging Face gated repository.
To create a Hugging Face user access token, see Access Tokens. Within your local environment, save your access token as an environment variable.
export HF_TOKEN="hf_..."
export HF_TOKEN="hf_..."
Generate a sample description
You can generate an image description using the
max-pipelines generate
command. Downloading the
Llama 3.2 11B Vision Instruct model weights takes some time.
max-pipelines generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
max-pipelines generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
When using the max-pipelines
CLI tool with multimodal input, you must provide
both a --prompt
and an --image_url
. Additionally, the prompt should be in a
valid format for the model used. For Llama 3.2 Vision 11B Instruct, you must
include the <|image|>
tag in the prompt if the input includes an image to
reason about. For more information about Llama 3.2 Vision prompt templates, see
Vision Model Inputs and Outputs.
Serve the Llama 3.2 Vision model
You can alternatively serve the Llama 3.2 Vision model and make multiple
requests to a local endpoint. If you already tested the model with the
max-pipelines generate
command, you do not have to wait for the model to
download again.
Serve the model with the max-pipelines serve
command:
max-pipelines serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1
max-pipelines serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1
The endpoint is ready when you see this message printed in your terminal:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Test the endpoint
After the server is running, you can test it by opening a new terminal window
and sending a curl
request.
The following request includes an image URL and a question to answer about the provided image:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
This sends an image along with a text prompt to the model, and you should receive a response describing the image. You can test the endpoint with any local base64-encoded image or any image URL.
Next steps
Now that you have successfully deployed Llama 3.2 Vision, you can:
- Experiment with different images and prompts
- Explore deployment configurations and additional features, such as function calling, prefix caching, and structured output
- Deploy the model to a containerized cloud environment for scalable serving
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!