Skip to main content

Video generation

With MAX, you can deploy open-source video generation models on your local system or in the cloud and send inference requests with our REST API. This page explains how to use the v1/responses endpoint to generate videos from text prompts or animate existing images.

Endpoint​

The MAX v1/responses endpoint provides a unified interface for diverse AI tasks including video generation. It's built on Open Responses, an open-source initiative to create a standardized, provider-agnostic API specification. The examples below show the request and response format. To try it yourself, see the quickstart.

Text input​

For text-to-video generation, set input to a plain string describing the video you want. The model returns the generated video as base64-encoded mp4 data in output[0].content[0].video_data when response_format is b64_json:

response = client.responses.create(
    model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    input="A campfire crackles in a forest clearing at night, sparks spiraling upward into a star-filled sky",
    extra_body={
        "provider_options": {
            "video": {
                "height": 512,
                "width": 512,
                "steps": 28,
                "num_frames": 81,
                "frames_per_second": 16,
                "response_format": "b64_json"
            }
        }
    }
)

video_data = response.output[0].content[0].video_data

Response format​

Video output supports two delivery formats, set via provider_options.video.response_format:

ValueDescription
url(Default) The server saves the video to a temporary file and returns a URL at /v1/videos/{video_id}/content. Download with a second GET request.
b64_jsonThe server encodes the video as base64 mp4 and returns it inline in output[0].content[0].video_data. No second request required.

To download a URL-format response:

import urllib.request

video_url = response.output[0].content[0].video_url
urllib.request.urlretrieve(f"http://localhost:8000{video_url}", "output.mp4")

Provider options​

The provider_options argument is an extension point in the Open Responses spec that lets each API provider expose parameters beyond the standard request fields. MAX uses it to surface video generation controls such as dimensions, frame count, and denoising steps.

The following are some commonly used parameters under provider_options.video. This is not an exhaustive list. For the complete reference, see provider_options.

ParameterTypeDefaultDescription
heightintegermodel defaultOutput height in pixels.
widthintegermodel defaultOutput width in pixels.
num_framesintegermodel defaultNumber of frames to generate. Total duration equals num_frames / frames_per_second.
frames_per_secondinteger16Frame rate for the output video.
stepsintegermodel defaultNumber of denoising steps. More steps generally produce higher quality but take longer.
guidance_scalenumber3.5How closely the output follows the prompt. Higher values (7–10) increase prompt adherence; lower values (1–3) allow more creative variation.
negative_promptstringnullContent to avoid in the output.
response_formatstringurlOutput delivery format: url returns a download link; b64_json returns the video inline in output[0].content[0].video_data.

Duration: total video length is num_frames / frames_per_second. At the default 16 fps, 81 frames yields approximately 5 seconds of video.

Negative prompts: use negative_prompt to steer the model away from unwanted content, for example "blurry, low quality, static, no motion". Keep the description of what you don't want in this field rather than embedding it in the main prompt.

If you encounter memory errors, try reducing output dimensions or the number of frames:

"provider_options": {"video": {"height": 480, "width": 480, "num_frames": 49}}

For advanced parameters including guidance_scale_2, true_cfg_scale, cfg_normalization, cfg_truncation, and residual_threshold, see the /v1/responses API reference.

Quickstart​

In this quickstart, learn how to set up and run Wan2.2-T2V-A14B-Diffusers for text-to-video generation.

System requirements:

Set up your environment​

Create a Python project to install our APIs and CLI tools:

  1. If you don't have it, install pixi:
    curl -fsSL https://pixi.sh/install.sh | sh

    Then restart your terminal for the changes to take effect.

  2. Create a project:
    pixi init video-generation-quickstart \
      -c https://conda.modular.com/max-nightly/ -c conda-forge \
      && cd video-generation-quickstart
  3. Install modular (nightlyTo get the stable build, change the version in the website header.):
    pixi add modular
  4. Start the virtual environment:
    pixi shell

Serve your model​

First, enable the v1/responses endpoint by setting the MAX_SERVE_API_TYPES environment variable:

export MAX_SERVE_API_TYPES='["responses"]'

Then, use the max serve command to start a local model server:

max serve \
  --model Wan-AI/Wan2.2-T2V-A14B-Diffusers

The endpoint is ready when you see this message printed in your terminal:

Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)

For a complete list of max CLI commands and options, refer to the MAX CLI reference.

Generate a video from text​

Send a request to Wan-AI/Wan2.2-T2V-A14B-Diffusers and retrieve a base64-encoded mp4 in response.

You can use OpenAI's Python client to interact with the video generation model. First, install the OpenAI SDK:

pixi add openai

Then, create a client and make a request to the model:

generate-video.py
import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.responses.create(
    model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    input="A campfire crackles in a forest clearing at night, sparks spiraling upward into a star-filled sky",
    extra_body={
        "provider_options": {
            "video": {
                "height": 512,
                "width": 512,
                "steps": 28,
                "num_frames": 81,
                "frames_per_second": 16,
                "response_format": "b64_json"
            }
        }
    }
)

video_data = response.output[0].content[0].video_data
with open("output-text-to-video.mp4", "wb") as f:
    f.write(base64.b64decode(video_data))

Run the script to generate the video:

python generate-video.py

The model saves the generated video to output-text-to-video.mp4 in your current directory.

Your output should look similar to the following:

Figure 1. Text-to-video output: a campfire crackling in a forest clearing at night.

Model modalities​

MAX supports the Wan 2.2 Diffusers model family for video generation. Models in this family differ in what input they accept. Choosing the wrong model for your input type causes a runtime error, so it's worth understanding the three modalities before serving a model:

  • T2V (text-to-video): accepts a text prompt only and generates video from scratch. Examples: Wan2.2-T2V-*.
  • I2V (image-to-video): requires a still image plus a text prompt describing the desired motion. Sending a text-only request to an I2V model causes a tensor shape mismatch error at runtime. Examples: Wan2.2-I2V-*.
  • TI2V (text-and-image-to-video): accepts either a text prompt alone or a text prompt with an image. Use this modality when you want a single deployment that handles both workflows. Examples: Wan2.2-TI2V-*.

For the full list of supported video generation models, see Supported models.

Next steps​

Now that you can generate videos, explore other inference capabilities and deployment options.

Was this page helpful?