Skip to main content
Log in

Generate image descriptions with Llama 3.2 Vision

Judy Heflin

MAX (Modular Accelerated Xecution) now supports multimodal models, simplifying the deployment of AI systems that handle both text and images. You can now serve models like Llama 3.2 11B Vision Instruct, which excels at tasks such as image captioning and visual question answering. This guide walks you through installing the necessary tools, configuring access, and serving the model with MAX.

Set up your environment

Create a Python project to install our APIs and CLI tools.

  1. Create a project folder:
    mkdir modular && cd modular
    mkdir modular && cd modular
  2. Create and activate a virtual environment:
    python3 -m venv .venv/modular \
    && source .venv/modular/bin/activate
    python3 -m venv .venv/modular \
    && source .venv/modular/bin/activate
  3. Install the modular Python package:
    pip install modular \
    --index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://dl.modular.com/public/nightly/python/simple/
    pip install modular \
    --index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://dl.modular.com/public/nightly/python/simple/

Configure Hugging Face access

To download and use Llama 3.2 11B Vision Instruct from Hugging Face, you must have a Hugging Face account, a Hugging Face user access token, and access to the Llama 3.2 11B Vision Instruct Hugging Face gated repository.

To create a Hugging Face user access token, see Access Tokens. Within your local environment, save your access token as an environment variable.

export HF_TOKEN="hf_..."
export HF_TOKEN="hf_..."

Generate a sample description

You can generate an image description using the max generate command. Downloading the Llama 3.2 11B Vision Instruct model weights takes some time.

max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172
max generate \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt "<|image|><|begin_of_text|>What is in this image?" \
--image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
--max-new-tokens 100 \
--max-batch-size 1 \
--max-length 108172

When using the max CLI tool with multimodal input, you must provide both a --prompt and an --image_url. Additionally, the prompt should be in a valid format for the model used. For Llama 3.2 Vision 11B Instruct, you must include the <|image|> tag in the prompt if the input includes an image to reason about. For more information about Llama 3.2 Vision prompt templates, see Vision Model Inputs and Outputs.

Serve the Llama 3.2 Vision model

You can alternatively serve the Llama 3.2 Vision model and make multiple requests to a local endpoint. If you already tested the model with the max generate command, you do not have to wait for the model to download again.

Serve the model with the max serve command:

max serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1
max serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

Test the endpoint

After the server is running, you can test it by opening a new terminal window and sending a curl request.

The following request includes an image URL and a question to answer about the provided image:

curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

This sends an image along with a text prompt to the model, and you should receive a response describing the image. You can test the endpoint with any local base64-encoded image or any image URL.

Next steps

Now that you have successfully deployed Llama 3.2 Vision, you can:

  • Experiment with different images and prompts
  • Explore deployment configurations and additional features, such as function calling, prefix caching, and structured output
  • Deploy the model to a containerized cloud environment for scalable serving

Did this tutorial work for you?