Offline inference
Offline inference with MAX allows you to run large language models directly in Python without relying on external API endpoints. This is in contrast to online inference, where you would send requests to a remote service.
When to use offline inference
You'll want to use offline inference in scenarios where you want to perform model inference without the need for a separate model inference server. Typically this includes where you have to process a batch of inputs concurrently.
This approach is beneficial for tasks that require high throughput and can be executed in a controlled environment, such as data preprocessing, model evaluation, or when working with large datasets that need to be processed in batches.
How offline inference works
The core of offline inference revolves around the the
LLM
class which provides
a Python interface to load and run language models.
Specify the model from a Hugging Face repository or a local path and MAX handles
the process of downloading the model. The
PipelineConfig
class allows you to specify parameters related to the inference pipeline, such as
max_length
and
max_num_steps
.
The generate()
function is used to generate text from the model.
Quickstart
This quickstart demonstrates how to use offline inference using a Hugging Face model with MAX in Python.
-
Set up your project:
- pip
- uv
- conda
- pixi
- Create a project folder:
mkdir example-project && cd example-project
mkdir example-project && cd example-project
- Create and activate a virtual environment:
python3 -m venv .venv/example-project \
&& source .venv/example-project/bin/activatepython3 -m venv .venv/example-project \
&& source .venv/example-project/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init example-project && cd example-project
uv init example-project && cd example-project
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-match
- If you don't have it, install conda. A common choice is with
brew
:brew install miniconda
brew install miniconda
- Initialize
conda
for shell interaction:conda init
conda init
If you're on a Mac, instead use:
conda init zsh
conda init zsh
Then restart your terminal for the changes to take effect.
- Create a project:
conda create -n example-project
conda create -n example-project
- Start the virtual environment:
conda activate example-project
conda activate example-project
- Install the
modular
conda package:- Nightly
- Stable
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
- If you don't have it, install
pixi
:curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh
Then restart your terminal for the changes to take effect.
- Create a project:
pixi init example-project \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd example-projectpixi init example-project \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd example-project - Install the
modular
conda package:- Nightly
- Stable
pixi add modular
pixi add modular
pixi add "modular==25.3"
pixi add "modular==25.3"
- Start the virtual environment:
pixi shell
pixi shell
-
Create a file named
main.py
with the following code:from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
llm = LLM(pipeline_config)
prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]
print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()
if __name__ == "__main__":
main()from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
llm = LLM(pipeline_config)
prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]
print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()
if __name__ == "__main__":
main()This script downloads the
modularai/Llama-3.1-8B-Instruct-GGUF
model (if not already downloaded) and then run inference locally. While the initial model download requires internet access, the actual inference process is self-contained and does not send requests to a remote service for generating text.You can update the script to use a different model or modify the prompts to generate different responses. For a list of available models, see our Model repository. We chose the Llama-3.1-8B-Instruct-GGUF model for this example because it's not gated, meaning it's freely available without requiring special access permissions or authentication.
For offline inference, MAX supports models in GGUF format. This includes most generative LLMs with "Chat" modality, but the specific configuration parameters might vary between models. Always refer to the model's documentation for compatibility details and optimal configuration settings.
-
Run the script:
python main.py
python main.py
This command will download the model and generate responses for the prompts.
You should see output like the following:
Generating responses...
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the
closest major galaxy to our own Milky Way, and it's been a source of fascination
for astronomers and space enthusiasts for centuries. But what if I told you that
there's
========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is
========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are
some ideas for projects that you can use to learn Python:
1. **Command Line Calculator**: Create a command line calculator that can perform
basic arithmetic operations like addition, subtraction, multiplication, and
division.Generating responses...
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the
closest major galaxy to our own Milky Way, and it's been a source of fascination
for astronomers and space enthusiasts for centuries. But what if I told you that
there's
========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is
========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are
some ideas for projects that you can use to learn Python:
1. **Command Line Calculator**: Create a command line calculator that can perform
basic arithmetic operations like addition, subtraction, multiplication, and
division.
Next steps
For more information on offline inference, see the following:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!