Offline inference
Offline inference with MAX allows you to run large language models directly in Python without relying on external API endpoints. This is in contrast to online inference, where you would send requests to a remote service.
When to use offline inferenceβ
You'll want to use offline inference in scenarios where you want to perform model inference without the need for a separate model inference server. Typically this includes where you have to process a batch of inputs concurrently.
This approach is beneficial for tasks that require high throughput and can be executed in a controlled environment, such as data preprocessing, model evaluation, or when working with large datasets that need to be processed in batches.
How offline inference worksβ
The core of offline inference revolves around the
LLM class which provides
a Python interface to load and run language models.
Specify the model from a Hugging Face repository or a local path and MAX handles
the process of downloading the model. You can pass model settings such as
max_length
as kwargs to
MAXModelConfig, and
runtime settings such as
max_num_steps
to
PipelineRuntimeConfig,
then pass these sub-configs to
PipelineConfig. Then
call
generate()
to generate text from the model.
Quickstartβ
This quickstart demonstrates how to use offline inference using a Hugging Face model with MAX in Python.
-
Set up your project:
- pixi
- uv
- If you don't have it, install
pixi:curl -fsSL https://pixi.sh/install.sh | shThen restart your terminal for the changes to take effect.
- Create a project:
pixi init offline-quickstart \ -c https://conda.modular.com/max-nightly/ -c conda-forge \ && cd offline-quickstart - Install
modular(nightly):pixi add modular - Start the virtual environment:
pixi shell
- If you don't have it, install
uv:curl -LsSf https://astral.sh/uv/install.sh | shThen restart your terminal to make
uvaccessible. - Create a project:
uv init offline-quickstart && cd offline-quickstart - Create and start a virtual environment:
uv venv && source .venv/bin/activate - Install
modular(nightly):uv pip install modular \ --index https://whl.modular.com/nightly/simple/ \ --prerelease allow
-
Create a file named
main.pywith the following code:main.pyfrom max.entrypoints.llm import LLM from max.pipelines import MAXModelConfig, PipelineConfig def main(): model_path = "google/gemma-3-12b-it" pipeline_config = PipelineConfig( model=MAXModelConfig(model_path=model_path) ) llm = LLM(pipeline_config) prompts = [ "In the beginning, there was", "I believe the meaning of life is", "The fastest way to learn python is", ] print("Generating responses...") responses = llm.generate(prompts, max_new_tokens=50) for i, (prompt, response) in enumerate( zip(prompts, responses, strict=True) ): print(f"========== Response {i} ==========") print(prompt + response) print() if __name__ == "__main__": main()For offline inference, specific configuration parameters might vary between models. Always refer to the model's documentation for compatibility details and optimal configuration settings.
-
Run the script:
python main.pyYou should see a response similar to the following:
Generating responses... ========== Response 0 ========== In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's ========== Response 1 ========== I believe the meaning of life is to find your gift. The purpose of life is to give it away to others. I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others. I believe that the meaning of life is ========== Response 2 ========== The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python: 1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
This code downloads the
modularai/Llama-3.1-8B-Instruct-GGUF
model (if not already downloaded) and runs inference locally. If you'd like to
use a different model, see our supported models. This example
uses the Llama-3.1-8B-Instruct-GGUF model for this example because it's not
gated, meaning it doesn't require authentication with Hugging Face.
Next stepsβ
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!