> For the complete documentation index, see [llms.txt](https://docs.modular.com/llms.txt).
> Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

# Quickstart

import {
  ModelSelector,
  DynamicCode,
  ConditionalContent,
} from "@site/src/components/ModelSelector";

A major component of the Modular Platform is MAX, our developer framework that
abstracts away the complexity of building and serving high-performance GenAI
models on a wide range of hardware, including NVIDIA and AMD GPUs.

Modular supports multiple deployment options including a
[managed cloud](https://docs.modular.com/max/deploy/cloud.md) solution, Modular's infrastructure in your
own VPC, or self-hosted endpoints with MAX. This quickstart shows you how to
self-host.

In this quickstart, you'll create an endpoint for an open-source LLM using MAX,
run an inference from a Python client, and then benchmark the endpoint.

If you'd rather create a self-hosted endpoint with Docker, see our
[tutorial to benchmark MAX](https://docs.modular.com/max/deploy/benchmark.md).

:::caution GPU strongly recommended

To see MAX perform as intended, we strongly recommend running on a
datacenter-grade GPU, such as **NVIDIA B200 / H200 / H100** or
**AMD MI355X / MI325X / MI300X**. You do have the option to follow along on
consumer-grade systems (including Macs)—just expect fewer compatible models and
slower performance.

:::

System requirements:

[Read the requirements](https://docs.modular.com/max/packages.md#system-requirements)

## Set up your project

First, install the `modular` package, which includes the `max` CLI you'll
use to start the model endpoint.

:::tip

For the most reliable experience, we recommend installing with `pixi`.

:::

## Start a model endpoint

Now you'll serve an LLM from a local endpoint using
[`max serve`](https://docs.modular.com/max/cli/serve.md). Choose whether you want **text-to-text**,
**image-to-text**, or **video-to-text** inference by switching tabs below, then
use the dropdown to select a model that fits your hardware's memory constraints.

Select a model to change the code below:

**textModelOptions:**

- **label:** gemma-4-31B-it, **value:** google/gemma-4-31B-it, **description:** Requires >96 GiB of GPU RAM — we suggest an NVIDIA B200 or AMD MI355X
- **label:** gemma-3-4b-it, **value:** google/gemma-3-4b-it, **description:** Requires >8 GiB of GPU RAM — works on most compatible GPUs
- **label:** gemma-3-12b-it, **value:** google/gemma-3-12b-it, **description:** Requires >24 GiB of GPU RAM — we suggest an A100, MI300, or better
- **label:** gemma-3-27b-it, **value:** google/gemma-3-27b-it, **description:** Requires >60 GiB of GPU RAM — we suggest an H100, MI300, or better
- **label:** Llama-3.1-8B-Instruct, **value:** meta-llama/Llama-3.1-8B-Instruct, **description:** Requires >15 GiB of RAM (GPU or CPU) — use this if you're on a Mac

**imageModelOptions:**

- **label:** gemma-4-31B-it, **value:** google/gemma-4-31B-it, **description:** Requires >96 GiB of GPU RAM — we suggest an NVIDIA B200 or AMD MI355X
- **label:** gemma-3-12b-it, **value:** google/gemma-3-12b-it, **description:** Requires >24 GiB of GPU RAM — we suggest an A100, MI300, or better
- **label:** gemma-3-27b-it, **value:** google/gemma-3-27b-it, **description:** Requires >60 GiB of GPU RAM — we suggest an H100, MI300, or better
- **label:** InternVL3-8B-Instruct, **value:** OpenGVLab/InternVL3-8B-Instruct, **description:** Requires >22 GiB of GPU RAM — we suggest an A100, MI300, or better
- **label:** InternVL3-14B-Instruct, **value:** OpenGVLab/InternVL3-14B-Instruct, **description:** Requires >36 GiB of GPU RAM — we suggest an H100, MI300, or better

**videoModelOptions:**

- **label:** gemma-4-31B-it, **value:** google/gemma-4-31B-it, **description:** Requires >96 GiB of GPU RAM — we suggest an NVIDIA B200 or AMD MI355X

**gemma-3 model:**

Google's [Gemma 3](https://huggingface.co/collections/google/gemma-3-release)
models are multimodal. MAX supports text input for all available sizes and image
input for the [12B](https://huggingface.co/google/gemma-3-12b-it) and
[27B](https://huggingface.co/google/gemma-3-27b-it) models. All sizes require a
[compatible GPU](https://docs.modular.com/max/packages.md#gpu-compatibility).

**gemma-4 model:**

MAX supports text, image, and video input for the developer-focused
[Gemma 4 31B](https://huggingface.co/google/gemma-4-31B-it) model. A
[compatible GPU](https://docs.modular.com/max/packages.md#gpu-compatibility) is required.

**Llama model:**

Meta's
[Llama 3.1 models](https://huggingface.co/collections/meta-llama/llama-31) are
text-only LLMs. You can pick any model in the family, but we suggest the smaller
[8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model because it
works on a wide range of CPUs, including on Macs.

Start the endpoint with the `max` CLI:

1. Add your [HF Access Token](https://huggingface.co/settings/tokens) as an
  environment variable:

    ```sh
    export HF_TOKEN="hf_..."
    ```

**gemma-4 model:**

2. Agree to the [Gemma 4 license](https://huggingface.co/google/gemma-4-31B-it).

**gemma-3 model:**

2. Agree to the [Gemma 3 license](https://huggingface.co/google/gemma-3-27b-it).

**Llama model:**

2. Agree to the [Llama 3.1 license](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

3. Start the endpoint:

```sh
max serve --model {text}
```

Select a model to change the code below:

**gemma-3 model:**

Google's [Gemma 3](https://huggingface.co/collections/google/gemma-3-release)
models are multimodal. MAX supports text input for all available sizes and image
input for the [12B](https://huggingface.co/google/gemma-3-12b-it) and
[27B](https://huggingface.co/google/gemma-3-27b-it) models. All sizes require a
[compatible GPU](https://docs.modular.com/max/packages.md#gpu-compatibility).

**gemma-4 model:**

MAX supports text, image, and video input for the developer-focused
[Gemma 4 31B](https://huggingface.co/google/gemma-4-31B-it) model. A
[compatible GPU](https://docs.modular.com/max/packages.md#gpu-compatibility) is required.

**InternVL3 model:**

OpenGVLab's multimodal
[InternVL3](https://huggingface.co/collections/OpenGVLab/internvl3) models come
in many sizes, but they all require a
[compatible GPU](https://docs.modular.com/max/packages.md#gpu-compatibility). They aren't gated on
Hugging Face, so you don't need to provide a Hugging Face access token to start
the endpoint.

**gemma-3 model:**

Agree to the [Gemma 3 license](https://huggingface.co/google/gemma-3-27b-it) and
add your [HF Access Token](https://huggingface.co/settings/tokens) as an
environment variable:

```bash
export HF_TOKEN="hf_..."
```

**gemma-4 model:**

Agree to the [Gemma 4 license](https://huggingface.co/google/gemma-4-31B-it) and
add your [HF Access Token](https://huggingface.co/settings/tokens) as an
environment variable:

```bash
export HF_TOKEN="hf_..."
```

Start the endpoint with the `max` CLI:

```sh
max serve --model {image} --trust-remote-code
```

Select a model to change the code below:

MAX supports video-to-text inference for the developer-focused
[Gemma 4 31B](https://huggingface.co/google/gemma-4-31B-it) model. A
[compatible GPU](https://docs.modular.com/max/packages.md#gpu-compatibility) is required.

Agree to the [Gemma 4 license](https://huggingface.co/google/gemma-4-31B-it) and
add your [HF Access Token](https://huggingface.co/settings/tokens) as an
environment variable:

```bash
export HF_TOKEN="hf_..."
```

Start the endpoint with the `max` CLI:

```sh
max serve --model {video}
```

It will take some time to download the model, compile it, and start the
server. While that's working, you can get started on the next step.

## Run inference with the endpoint

Open a new terminal and send an inference request using the `openai` Python
API:

**Text to text:**

1. Navigate to the project you created above and then install the `openai`
package:

    **pixi:**

```bash
        pixi add openai
        ```

---

**uv:**

```bash
        uv add openai
        ```

---

**pip:**

```bash
        pip install openai
        ```

---

**conda:**

```bash
        conda install -c conda-forge openai
        ```

2. Activate the virtual environment:

    **pixi:**

```bash
        pixi shell
        ```

---

**uv:**

```bash
        source .venv/bin/activate
        ```

---

**pip:**

```bash
        source .venv/quickstart/bin/activate
        ```

---

**conda:**

```bash
        conda init
        ```

        Or if you're on a Mac, use:

        ```bash
        conda init zsh
        ```

3. Create a new file that sends an inference request:

```python
from openai import OpenAI

    client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

    completion = client.chat.completions.create(
        model="{text}",
        messages=[
            {
              "role": "user",
              "content": "Who won the world series in 2020?"
            },
        ],
    )

    print(completion.choices[0].message.content)
```

    Notice that the `OpenAI` API requires the `api_key` argument, but you
    don't need that with MAX.

4. Wait until the model server is ready—when it is, you'll see this message in
your first terminal:

    ```output
    🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    ```

    Then run the Python script from your second terminal, and you should see
    results like this (your results may vary, especially for different model
    sizes):

    ```sh
    python generate-text.py
    ```

    ```output
    The **Los Angeles Dodgers** won the World Series in 2020!

    They defeated the Tampa Bay Rays 4 games to 2. It was their first World Series title since 1988.
    ```

---

**Image to text:**

1. Navigate to the project you created above and then install the `openai`
package:

    **pixi:**

```bash
        pixi add openai
        ```

---

**uv:**

```bash
        uv add openai
        ```

---

**pip:**

```bash
        pip install openai
        ```

---

**conda:**

```bash
        conda install -c conda-forge openai
        ```

2. Activate the virtual environment:

    **pixi:**

```bash
        pixi shell
        ```

---

**uv:**

```bash
        source .venv/bin/activate
        ```

---

**pip:**

```bash
        source .venv/quickstart/bin/activate
        ```

---

**conda:**

```bash
        conda init
        ```

        Or if you're on a Mac, use:

        ```bash
        conda init zsh
        ```

3. Create a new file that sends an inference request:

```python
from openai import OpenAI

    client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

    completion = client.chat.completions.create(
        model="{image}",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Write a caption for this image"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
                        }
                    }
                ]
            }
        ],
        max_tokens=300
    )

    print(completion.choices[0].message.content)
```

    Notice that the `OpenAI` API requires the `api_key` argument, but you
    don't need that with MAX.

4. Wait until the model server is ready—when it is, you'll see this message in
your first terminal:

    ```output
    🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    ```

    Then run the Python script from your second terminal, and you should see
    results like this (your results will always be different):

    ```sh
    python generate-image-caption.py
    ```

    ```output
    In a charming English countryside setting, Mr. Bun, dressed elegantly in a tweed outfit, stands proudly on a dirt path, surrounded by lush greenery and blooming wildflowers.
    ```

---

**Video to text:**

1. Navigate to the project you created above and then install the `openai`
package:

    **pixi:**

```bash
        pixi add openai
        ```

---

**uv:**

```bash
        uv add openai
        ```

---

**pip:**

```bash
        pip install openai
        ```

---

**conda:**

```bash
        conda install -c conda-forge openai
        ```

2. Activate the virtual environment:

    **pixi:**

```bash
        pixi shell
        ```

---

**uv:**

```bash
        source .venv/bin/activate
        ```

---

**pip:**

```bash
        source .venv/quickstart/bin/activate
        ```

---

**conda:**

```bash
        conda init
        ```

        Or if you're on a Mac, use:

        ```bash
        conda init zsh
        ```

3. Create a new file that sends an inference request:

```python
from openai import OpenAI

    client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

    completion = client.chat.completions.create(
        model="{video}",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe what is happening in this video"
                    },
                    {
                        "type": "video_url",
                        "video_url": {
                            "url": "https://avtshare01.rz.tu-ilmenau.de/avt-vqdb-uhd-1/test_1/segments/bigbuck_bunny_8bit_15000kbps_1080p_60.0fps_h264.mp4"
                        }
                    }
                ]
            }
        ],
        max_tokens=300
    )

    print(completion.choices[0].message.content)
```

    Notice that the `OpenAI` API requires the `api_key` argument, but you
    don't need that with MAX.

4. Wait until the model server is ready—when it is, you'll see this message in
your first terminal:

    ```output
    🚀 Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    ```

    Then run the Python script from your second terminal with a real video URL:

    ```sh
    python generate-video-description.py
    ```

## Benchmark the endpoint

While still in your second terminal, run the following command to benchmark
your endpoint:

**Text to text:**

```sh
max benchmark \\
    --model {text} \\
    --backend modular \\
    --endpoint /v1/chat/completions \\
    --dataset-name sonnet \\
    --num-prompts 500 \\
    --sonnet-input-len 550 \\
    --output-lengths 256 \\
    --sonnet-prefix-len 200 \\
    --max-concurrency 32
```

---

**Image to text:**

```sh
max benchmark \\
    --model {image} \\
    --backend modular \\
    --endpoint /v1/chat/completions \\
    --dataset-name random \\
    --num-prompts 500 \\
    --random-input-len 40 \\
    --random-output-len 150 \\
    --random-image-size 512,512 \\
    --random-image-count 1 \\
    --max-concurrency 32
```

---

**Video to text:**

:::note

    Video input isn't supported in `max benchmark` yet, but you can benchmark
    text and image input today using the endpoint you set up.

    :::

```sh
max benchmark \\
    --model {image} \\
    --backend modular \\
    --endpoint /v1/chat/completions \\
    --dataset-name random \\
    --num-prompts 500 \\
    --random-input-len 40 \\
    --random-output-len 150 \\
    --random-image-size 512,512 \\
    --random-image-count 1 \\
    --max-concurrency 32
```

When it's done, you'll see the results printed to the terminal.

If you want to save the results, pass a filename with `--result-filename` and
it'll save a JSON file at that path. The path can include a directory prefix.
For example:

```sh
max benchmark \
  ...
  --result-filename "results/quickstart-benchmark.json"
```

The benchmark options above are just a starting point. When you want to save
your own benchmark configurations, you can define them in a YAML file and pass
it to the `--config-file` option. For example configurations, see our
[benchmark config files on GitHub](https://github.com/modular/modular/tree/main/max/python/max/benchmark/configs).

For more details about the tool, including other datasets and configuration
options, see the [`max benchmark` documentation](https://docs.modular.com/max/cli/benchmark.md).

:::caution GPU ran out of memory

If the server log says `GPU ran out of memory during model execution`, try
reducing the benchmark input length with the option corresponding to your
dataset (`--sonnet-input-len` or `--random-input-len`). Also consider
restarting `max serve` and adding `--device-memory-utilization` with a value as
low as `0.5` (the default is `0.9`).

:::

## Next steps

Now that you've completed this quickstart, try serving and benchmarking a
different model from our
[supported models](https://docs.modular.com/max/models.md)!

Then, you can try to connect to your endpoint with our
[Agentic Cookbook](https://modul.ar/cookbook)—an open-source project for
building React-based interfaces for any model endpoint. Just clone the repo, run
it with npm, and pick a recipe such as a chat interface, a drag-and-drop image
caption tool, or build your own.

To get started, see the [project README](https://modul.ar/cookbook).

<img
  src={require("./images/cookbook-captioning.png").default}
  alt=""
  width="704"
/>

## Stay in touch