Offline inference

Offline inference with MAX allows you to run large language models directly in Python without relying on external API endpoints. This is in contrast to online inference, where you would send requests to a remote service.

When to use offline inference

You'll want to use offline inference in scenarios where you want to perform model inference without the need for a separate model inference server. Typically this includes where you have to process a batch of inputs concurrently.

This approach is beneficial for tasks that require high throughput and can be executed in a controlled environment, such as data preprocessing, model evaluation, or when working with large datasets that need to be processed in batches.

How offline inference works

The core of offline inference revolves around the the LLM class which provides a Python interface to load and run language models.

Specify the model from a Hugging Face repository or a local path and MAX handles the process of downloading the model. The PipelineConfig class allows you to specify parameters related to the inference pipeline, such as max_length and max_num_steps. The generate() function is used to generate text from the model.

The Python API for offline inference currently supports text-only input and does not support multi-modal models. If you need to work with vision capabilities, see Image to text.

Quickstart

This quickstart demonstrates how to use offline inference using a Hugging Face model with MAX in Python.

Set up your project:

pixi
uv
pip
conda

If you don't have it, install pixi:
```
curl -fsSL https://pixi.sh/install.sh | sh
```
Then restart your terminal for the changes to take effect.

Create a project:

pixi init offline-quickstart \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd offline-quickstart

Tip: You can skip the -c options if you add these channels as defaults.

Install the modular conda package:
- Nightly
- Stable
pixi add modular
pixi add "modular==25.7"
Start the virtual environment:
```
pixi shell
```

If you don't have it, install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Then restart your terminal to make uv accessible.

Create a project:

uv init offline-quickstart && cd offline-quickstart

Create and start a virtual environment:
```
uv venv && source .venv/bin/activate
```

Install the modular Python package:

Nightly
Stable

uv pip install modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/ \
  --prerelease allow

uv pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

Create a project folder:

mkdir offline-quickstart && cd offline-quickstart

Create and activate a virtual environment:

python3 -m venv .venv/offline-quickstart \
  && source .venv/offline-quickstart/bin/activate

Install the modular Python package:

Nightly
Stable

pip install --pre modular \
  --index-url https://dl.modular.com/public/nightly/python/simple/

pip install modular \
  --extra-index-url https://modular.gateway.scarf.sh/simple/

If you don't have it, install conda. A common choice is with brew:
```
brew install miniconda
```
Initialize conda for shell interaction:
```
conda init
```
If you're on a Mac, instead use:
```
conda init zsh
```
Then restart your terminal for the changes to take effect.
Create a project:
```
conda create -n offline-quickstart
```
Start the virtual environment:
```
conda activate offline-quickstart
```

Install the modular conda package:

Nightly
Stable

conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular

conda install -c conda-forge -c https://conda.modular.com/max/ modular

Create a file named main.py with the following code:

main.py
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig


def main():
    model_path = "google/gemma-3-12b-it"
    pipeline_config = PipelineConfig(model_path=model_path)
    llm = LLM(pipeline_config)

    prompts = [
        "In the beginning, there was",
        "I believe the meaning of life is",
        "The fastest way to learn python is",
    ]

    print("Generating responses...")
    responses = llm.generate(prompts, max_new_tokens=50)
    for i, (prompt, response) in enumerate(
        zip(prompts, responses, strict=True)
    ):
        print(f"========== Response {i} ==========")
        print(prompt + response)
        print()


if __name__ == "__main__":
    main()

You need both a valid Hugging Face token and model access approval to serve Gemma 3. To create a Hugging Face user access token, see Access Tokens. You can request model access through the Gemma 3 Hugging Face model repository.

For offline inference, specific configuration parameters might vary between models. Always refer to the model's documentation for compatibility details and optimal configuration settings.

Run the script:

python main.py

You should see a response similar to the following:

Generating responses...
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the
closest major galaxy to our own Milky Way, and it's been a source of fascination
for astronomers and space enthusiasts for centuries. But what if I told you that
there's

========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is

========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are
some ideas for projects that you can use to learn Python:

1. **Command Line Calculator**: Create a command line calculator that can perform
basic arithmetic operations like addition, subtraction, multiplication, and
division.

This code downloads the modularai/Llama-3.1-8B-Instruct-GGUF model (if not already downloaded) and runs inference locally. If you'd like to use a different model, see our Model repository. This example uses the Llama-3.1-8B-Instruct-GGUF model for this example because it's not gated, meaning it doesn't require authentication with Hugging Face.

When to use offline inference​

How offline inference works​

Quickstart​

Next steps​

When to use offline inference

How offline inference works

Quickstart

Next steps