Image and video to text
Multimodal large language models are capable of processing images, video, and text together in a single request. They can describe visual content, answer questions about images or video, and support tasks such as image captioning, document analysis, chart interpretation, optical character recognition (OCR), video summarization, and content moderation.
Explore our supported models to select the best model for your use case.
Endpoint
You can interact with a multimodal LLM through the
v1/chat/completions endpoint
by including image or video inputs alongside text in the request. This allows
you to provide an image URL, video URL, or base64-encoded data as part of the
conversation.
URL input
Within the v1/chat/completions request body, the "messages" array accepts
inline image or video URLs.
- Image input
- Video input
Use image_url to pass an image:
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/path/to/image.jpg"
}
}
]
}
]Use video_url to pass a video:
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is happening in this video?"
},
{
"type": "video_url",
"video_url": {
"url": "https://example.com/path/to/video.mp4"
}
}
]
}
]Both image_url and video_url also accept base64-encoded data URIs
(such as data:image/jpeg;base64,... or data:video/mp4;base64,...).
Local file input
To use local images or videos, you must configure allowed directories before starting the server. This prevents unauthorized file access by restricting which paths the server can read from.
Set the MAX_SERVE_ALLOWED_IMAGE_ROOTS environment variable to a JSON-formatted
list of allowed directories:
export MAX_SERVE_ALLOWED_IMAGE_ROOTS='["/path/to/files"]'Then reference files with an absolute file:// path:
- Image input
- Video input
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "file:///path/to/files/image.jpg"
}
}
]
}
]"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is happening in this video?"
},
{
"type": "video_url",
"video_url": {
"url": "file:///path/to/files/video.mp4"
}
}
]
}
]The file path must be within a directory listed in
MAX_SERVE_ALLOWED_IMAGE_ROOTS. If no allowed roots are configured, all
file:/// requests return a 400 error.
The maximum file size is 20 MiB by default, which you can adjust by setting the
MAX_SERVE_MAX_LOCAL_IMAGE_BYTES environment variable to a value in bytes.
Quickstart
In this quickstart, learn how to set up and run Gemma 4 31B Instruct, which excels at tasks such as image captioning, visual question answering, and video summarization.
System requirements:
Mac
Linux
WSL
GPU
Set up your environment
Create a Python project to install our APIs and CLI tools:
- pixi
- uv
- If you don't have it, install
pixi:curl -fsSL https://pixi.sh/install.sh | shThen restart your terminal for the changes to take effect.
- Create a project:
pixi init vision-quickstart \ -c https://conda.modular.com/max-nightly/ -c conda-forge \ && cd vision-quickstart - Install
modular(nightly):pixi add modular - Start the virtual environment:
pixi shell
- If you don't have it, install
uv:curl -LsSf https://astral.sh/uv/install.sh | shThen restart your terminal to make
uvaccessible. - Create a project:
uv init vision-quickstart && cd vision-quickstart - Create and start a virtual environment:
uv venv && source .venv/bin/activate - Install
modular(nightly):uv pip install modular \ --index https://whl.modular.com/nightly/simple/ \ --prerelease allow
Serve your model
Agree to the Gemma 4 license and make your Hugging Face access token available in your environment:
export HF_TOKEN="hf_..."Then, use the max serve command to start a
local model server with the Gemma 4 31B Instruct model:
max serve \
--model google/gemma-4-31B-itThis will create a server running the google/gemma-4-31B-it
multimodal model on http://localhost:8000/v1/chat/completions, an OpenAI
compatible endpoint.
While this example uses the Gemma 4 31B Instruct model, you can replace it with any image-to-text or video-to-text model listed in our supported models.
The endpoint is ready when you see this message printed in your terminal:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)For a complete list of max CLI commands and options, refer to the
MAX CLI reference.
Describe an image
Open a new terminal window, navigate to your project directory, and activate your virtual environment.
MAX supports OpenAI's REST APIs and you can interact with the model using either the OpenAI Python SDK or curl:
- Python
- curl
You can use OpenAI's Python client to interact with the vision model. First, install the OpenAI API:
- pixi
- uv
- pip
- conda
pixi add openaiuv add openaipip install openaiconda install openaiThen, create a client and make a request to the model:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
max_tokens=300
)
print(response.choices[0].message.content)In this example, you're using the OpenAI Python client to interact with the MAX
endpoint running on local host 8000. The client object is initialized with
the base URL http://0.0.0.0:8000/v1 and the API key is ignored.
When you run this code, the model should respond with information about the image:
python generate-image-description.pyHere's a breakdown of what's in the image:
* **Peter Rabbit:** The main focus is a realistic-looking depiction of Peter
Rabbit, the character from Beatrix Potter's stories...You can send requests to the local endpoint using curl.
The following request includes an image URL and a question to answer about the
provided image:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-31B-it",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'This sends an object location to an image along with a text prompt to the model. You should receive a response similar to this:
Here's a breakdown of what's in the image:
* **Peter Rabbit:** The main focus is a realistic, anthropomorphic
(human-like) rabbit character...Describe a video
- Python
- curl
Create a new file and make a request to the model with a video URL:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
completion = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe what is happening in this video"
},
{
"type": "video_url",
"video_url": {
"url": "https://avtshare01.rz.tu-ilmenau.de/avt-vqdb-uhd-1/test_1/segments/bigbuck_bunny_8bit_15000kbps_1080p_60.0fps_h264.mp4"
}
}
]
}
],
max_tokens=300
)
print(completion.choices[0].message.content)Run the script to get a description of the video:
python generate-video-description.pyThe video is an animated short film featuring a large, fluffy rabbit in a
colorful meadow. The rabbit wanders through the environment, encountering
butterflies and small birds. The animation has a warm, lighthearted tone with
vibrant natural scenery...You can send requests to the local endpoint using curl.
The following request includes a video URL and a prompt to describe it:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-31B-it",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe what is happening in this video"
},
{
"type": "video_url",
"video_url": {
"url": "https://avtshare01.rz.tu-ilmenau.de/avt-vqdb-uhd-1/test_1/segments/bigbuck_bunny_8bit_15000kbps_1080p_60.0fps_h264.mp4"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'You should receive a response similar to this:
The video is an animated short film featuring a large, fluffy rabbit in a
colorful meadow. The rabbit wanders through the environment, encountering
butterflies and small birds. The animation has a warm, lighthearted tone with
vibrant natural scenery...For complete details on all available API endpoints and options, see the MAX Serve API documentation.
Next steps
Now that you can analyze images and video, try adding structured output to get consistent, formatted responses. You can also explore other endpoints and deployment options.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!