Video generation

With MAX, you can deploy open-source video generation models on your local system or in the cloud and send inference requests with our REST API. This page explains how to use the v1/responses endpoint to generate videos from text prompts or animate existing images.

Endpoint

The MAX v1/responses endpoint provides a unified interface for diverse AI tasks including video generation. It's built on Open Responses, an open-source initiative to create a standardized, provider-agnostic API specification. The examples below show the request and response format. To try it yourself, see the quickstart.

Text input

For text-to-video generation, set input to a plain string describing the video you want. The model returns the generated video as base64-encoded mp4 data in output[0].content[0].video_data when response_format is b64_json:

Python
curl

response = client.responses.create(
    model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    input="A campfire crackles in a forest clearing at night, sparks spiraling upward into a star-filled sky",
    extra_body={
        "provider_options": {
            "video": {
                "height": 512,
                "width": 512,
                "steps": 28,
                "num_frames": 81,
                "frames_per_second": 16,
                "response_format": "b64_json"
            }
        }
    }
)

video_data = response.output[0].content[0].video_data

curl -X POST http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    "input": "A campfire crackles in a forest clearing at night, sparks spiraling upward into a star-filled sky",
    "provider_options": {
      "video": {
        "height": 512,
        "width": 512,
        "steps": 28,
        "num_frames": 81,
        "frames_per_second": 16,
        "response_format": "b64_json"
      }
    }
  }' | jq -r '.output[0].content[0].video_data' | base64 -d > output.mp4

Response format

Video output supports two delivery formats, set via provider_options.video.response_format:

Value	Description
`url`	(Default) The server saves the video to a temporary file and returns a URL at `/v1/videos/{video_id}/content`. Download with a second `GET` request.
`b64_json`	The server encodes the video as base64 mp4 and returns it inline in `output[0].content[0].video_data`. No second request required.

To download a URL-format response:

import urllib.request

video_url = response.output[0].content[0].video_url
urllib.request.urlretrieve(f"http://localhost:8000{video_url}", "output.mp4")

Provider options

The provider_options argument is an extension point in the Open Responses spec that lets each API provider expose parameters beyond the standard request fields. MAX uses it to surface video generation controls such as dimensions, frame count, and denoising steps.

The following are some commonly used parameters under provider_options.video. This is not an exhaustive list. For the complete reference, see provider_options.

Parameter	Type	Default	Description
`height`	integer	model default	Output height in pixels.
`width`	integer	model default	Output width in pixels.
`num_frames`	integer	model default	Number of frames to generate. Total duration equals `num_frames / frames_per_second`.
`frames_per_second`	integer	`16`	Frame rate for the output video.
`steps`	integer	model default	Number of denoising steps. More steps generally produce higher quality but take longer.
`guidance_scale`	number	`3.5`	How closely the output follows the prompt. Higher values (7–10) increase prompt adherence; lower values (1–3) allow more creative variation.
`negative_prompt`	string	`null`	Content to avoid in the output.
`response_format`	string	`url`	Output delivery format: `url` returns a download link; `b64_json` returns the video inline in `output[0].content[0].video_data`.

Duration: total video length is num_frames / frames_per_second. At the default 16 fps, 81 frames yields approximately 5 seconds of video.

Negative prompts: use negative_prompt to steer the model away from unwanted content, for example "blurry, low quality, static, no motion". Keep the description of what you don't want in this field rather than embedding it in the main prompt.

If you encounter memory errors, try reducing output dimensions or the number of frames:

"provider_options": {"video": {"height": 480, "width": 480, "num_frames": 49}}

For advanced parameters including guidance_scale_2, true_cfg_scale, cfg_normalization, cfg_truncation, and residual_threshold, see the /v1/responses API reference.

Quickstart

In this quickstart, learn how to set up and run Wan2.2-T2V-A14B-Diffusers for text-to-video generation.

GPU required

To run Wan2.2-T2V-A14B-Diffusers, your system must have a compatible GPU with sufficient GPU RAM. Video generation is GPU-intensive and typically takes several minutes per clip on a single GPU.

System requirements:

Mac

Linux

WSL

GPU

Set up your environment

Create a Python project to install our APIs and CLI tools:

pixi
uv

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init video-generation-quickstart \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd video-generation-quickstart

Tip: You can skip the -c options if you add these channels as defaults.

Install modular (nightlyTo get the stable build, change the version in the website header.):
```
pixi add modular
```
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.

Create a project:

uv init video-generation-quickstart && cd video-generation-quickstart

Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install modular (nightlyTo get the stable build, change the version in the website header.):

uv pip install modular \
    --index https://whl.modular.com/nightly/simple/ \
    --prerelease allow

Serve your model

First, enable the v1/responses endpoint by setting the MAX_SERVE_API_TYPES environment variable:

export MAX_SERVE_API_TYPES='["responses"]'

Then, use the max serve command to start a local model server:

max serve \
  --model Wan-AI/Wan2.2-T2V-A14B-Diffusers

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

For a complete list of max CLI commands and options, refer to the MAX CLI reference.

Generate a video from text

Send a request to Wan-AI/Wan2.2-T2V-A14B-Diffusers and retrieve a base64-encoded mp4 in response.

Python
curl

You can use OpenAI's Python client to interact with the video generation model. First, install the OpenAI SDK:

pixi
uv
pip
conda

pixi add openai

uv add openai

pip install openai

conda install openai

Then, create a client and make a request to the model:

generate-video.py
import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.responses.create(
    model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    input="A campfire crackles in a forest clearing at night, sparks spiraling upward into a star-filled sky",
    extra_body={
        "provider_options": {
            "video": {
                "height": 512,
                "width": 512,
                "steps": 28,
                "num_frames": 81,
                "frames_per_second": 16,
                "response_format": "b64_json"
            }
        }
    }
)

video_data = response.output[0].content[0].video_data
with open("output-text-to-video.mp4", "wb") as f:
    f.write(base64.b64decode(video_data))

Run the script to generate the video:

python generate-video.py

The model saves the generated video to output-text-to-video.mp4 in your current directory.

Your output should look similar to the following:

Figure 1. Text-to-video output: a campfire crackling in a forest clearing at night.

Send a request to the v1/responses endpoint and decode the base64-encoded video data from the response:

curl -X POST http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    "input": "A campfire crackles in a forest clearing at night, sparks spiraling upward into a star-filled sky",
    "provider_options": {
      "video": {
        "height": 512,
        "width": 512,
        "steps": 28,
        "num_frames": 81,
        "frames_per_second": 16,
        "response_format": "b64_json"
      }
    }
  }' | jq -r '.output[0].content[0].video_data' | base64 -d > output-text-to-video.mp4

Your output should look similar to the following:

Figure 1. Text-to-video output: a campfire crackling in a forest clearing at night.

Model modalities

MAX supports the Wan 2.2 Diffusers model family for video generation. Models in this family differ in what input they accept. Choosing the wrong model for your input type causes a runtime error, so it's worth understanding the three modalities before serving a model:

T2V (text-to-video): accepts a text prompt only and generates video from scratch. Examples: Wan2.2-T2V-*.
I2V (image-to-video): requires a still image plus a text prompt describing the desired motion. Sending a text-only request to an I2V model causes a tensor shape mismatch error at runtime. Examples: Wan2.2-I2V-*.
TI2V (text-and-image-to-video): accepts either a text prompt alone or a text prompt with an image. Use this modality when you want a single deployment that handles both workflows. Examples: Wan2.2-TI2V-*.

For the full list of supported video generation models, see Supported models.

Next steps

Now that you can generate videos, explore other inference capabilities and deployment options.

Image generation

Generate images from text prompts or transform existing images using the MAX v1/responses endpoint

Image and video to text

Use the MAX chat completions endpoint with image or video input to generate descriptions and answer questions about visual content

Deploy MAX on GPU with self-hosted endpoints

Learn how to deploy MAX pipelines to cloud

Endpoint​

Text input​

Response format​

Provider options​

Quickstart​

Set up your environment​

Serve your model​

Generate a video from text​

Model modalities​

Next steps​