Image to text

Multimodal large language models are capable of processing images and text together in a single request. They can describe visual content, answer questions about images, and support tasks such as image captioning, document analysis, chart interpretation, optical character recognition (OCR), and content moderation.

Endpoint

You can interact with a multimodal LLM through the v1/chat/completions endpoint by including image inputs alongside text in the request. This allows you to provide an image URL or base64-encoded image as part of the conversation, enabling use cases such as image captioning, asking questions about a photo, requesting a chart summary, or combining text prompts with visual context.

URL input

Within the v1/chat/completions request body, the "messages" array accepts inline image URLs. For example:

"messages": [
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this image?"
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "https://example.com/path/to/image.jpg"
        }
      }
    ]
  }
]

Local file input

To use local images, you must configure allowed directories before starting the server. This prevents unauthorized file access by restricting which paths the server can read from.

Set the MAX_SERVE_ALLOWED_IMAGE_ROOTS environment variable to a JSON-formatted list of allowed directories:

export MAX_SERVE_ALLOWED_IMAGE_ROOTS='["/path/to/images"]'

Then reference files with an absolute path:

"messages": [
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this image?"
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "file:///path/to/images/image.jpg"
        }
      }
    ]
  }
]

The file path must be within a directory listed in MAX_SERVE_ALLOWED_IMAGE_ROOTS. If no allowed roots are configured, all file:/// requests return a 400 error.

The maximum file size is 20 MiB by default, which you can adjust by setting the MAX_SERVE_MAX_LOCAL_IMAGE_BYTES environment variable to a value in bytes.

Quickstart

In this quickstart, learn how to set up and run Gemma 3 27B Instruct, which excels at tasks such as image captioning and visual question answering.

GPU required

To run Gemma 3 27B Instruct, your system must have a compatible GPU with >60 GiB of GPU RAM.

Due to the model's memory requirements, we recommend an NVIDIA B200, H200, or AMD MI355X.

System requirements:

Mac

Linux

WSL

GPU

Set up your environment

Create a Python project to install our APIs and CLI tools:

pixi
uv
pip
conda

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init vision-quickstart \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd vision-quickstart

Tip: You can skip the -c options if you add these channels as defaults.

Install the modular conda package:
- Nightly
- Stable
pixi add modular
pixi add "modular==26.1"
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.

Create a project:

uv init vision-quickstart && cd vision-quickstart

Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index https://whl.modular.com/nightly/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

Create a project folder:

mkdir vision-quickstart && cd vision-quickstart

Create and activate a virtual environment:

python3 -m venv .venv/vision-quickstart \
  && source .venv/vision-quickstart/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --extra-index-url https://whl.modular.com/nightly/simple/

pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
Then restart your terminal for the changes to take effect.
Create a project:
```
conda create -n vision-quickstart
```
Start the virtual environment:
```
conda activate vision-quickstart
```

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular

Serve your model

Agree to the Gemma 3 license and make your Hugging Face access token available in your environment:

export HF_TOKEN="hf_..."

Then, use the max serve command to start a local model server with the Gemma 3 27B Instruct model:

max serve \
  --model google/gemma-3-27b-it

You may need to specify the --max-length and --max-batch-size parameters depending on the amount of memory you have access to.

This will create a server running the google/gemma-3-27b-it multimodal model on http://localhost:8000/v1/chat/completions, an OpenAI compatible endpoint.

While this example uses the Gemma 3 27B Instruct model, you can replace it with any of the vision models listed in our model repository.

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

For a complete list of max CLI commands and options, refer to the MAX CLI reference.

Interact with your model

Open a new terminal window, navigate to your project directory, and activate your virtual environment.

MAX supports OpenAI's REST APIs and you can interact with the model using either the OpenAI Python SDK or curl:

Python
curl

You can use OpenAI's Python client to interact with the vision model. First, install the OpenAI API:

pixi
uv
pip
conda

pixi add openai

uv add openai

pip install openai

conda install openai

Then, create a client and make a request to the model:

generate-image-description.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="google/gemma-3-27b-it",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
                    }
                }
            ]
        }
    ],
    max_tokens=300
)

print(response.choices[0].message.content)

In this example, you're using the OpenAI Python client to interact with the MAX endpoint running on local host 8000. The client object is initialized with the base URL http://0.0.0.0:8000/v1 and the API key is ignored.

When you run this code, the model should respond with information about the image:

python generate-image-description.py

Here's a breakdown of what's in the image:

*   **Peter Rabbit:** The main focus is a realistic-looking depiction of Peter
Rabbit, the character from Beatrix Potter's stories...

You can send requests to the local endpoint using curl. The following request includes an image URL and a question to answer about the provided image:

curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "google/gemma-3-27b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

This sends an object location to an image along with a text prompt to the model. You should receive a response similar to this:

Here's a breakdown of what's in the image:

*   **Peter Rabbit:** The main focus is a realistic, anthropomorphic
(human-like) rabbit character...

When making requests with max serve, you do not need to include model-specific image tags within your prompt.

For complete details on all available API endpoints and options, see the MAX Serve API documentation.

Next steps

Now that you can analyze images, try adding structured output to get consistent, formatted responses. You can also explore other endpoints and deployment options.

Endpoint​

URL input​

Local file input​

Quickstart​

Set up your environment​

Serve your model​

Interact with your model​

Next steps​