Quickstart
In this quickstart guide, you'll learn how to install Modular in a Python environment and run inference with a GenAI model. We'll first use our Python API to run offline inference, then start a local endpoint and use the OpenAI Python API to send inference requests.
System requirements:
Mac
Linux
WSL
Docker
Set up your project
First, install the max
CLI and Python library:
- pip
- uv
- conda
- pixi
- Create a project folder:
mkdir example-project && cd example-project
mkdir example-project && cd example-project
- Create and activate a virtual environment:
python3 -m venv .venv/example-project \
&& source .venv/example-project/bin/activatepython3 -m venv .venv/example-project \
&& source .venv/example-project/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init example-project && cd example-project
uv init example-project && cd example-project
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-match
- If you don't have it, install conda. A common choice is with
brew
:brew install miniconda
brew install miniconda
- Initialize
conda
for shell interaction:conda init
conda init
If you're on a Mac, instead use:
conda init zsh
conda init zsh
Then restart your terminal for the changes to take effect.
- Create a project:
conda create -n example-project
conda create -n example-project
- Start the virtual environment:
conda activate example-project
conda activate example-project
- Install the
modular
conda package:- Nightly
- Stable
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
- If you don't have it, install
pixi
:curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh
Then restart your terminal for the changes to take effect.
- Create a project:
pixi init example-project \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd example-projectpixi init example-project \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd example-project - Install the
modular
conda package:- Nightly
- Stable
pixi add modular
pixi add modular
pixi add "modular==25.3"
pixi add "modular==25.3"
- Start the virtual environment:
pixi shell
pixi shell
Run offline inference
You can run inference locally with the max
Python API. Just specify
the Hugging Face model you want and then generate results with one or more
prompts.
In this example, we use a Llama 3.1 model that's not gated on Hugging Face, so you don't need an access token:
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
from max.serve.config import Settings
def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
settings = Settings()
llm = LLM(settings, pipeline_config)
prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]
print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()
if __name__ == "__main__":
main()
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig
from max.serve.config import Settings
def main():
model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
pipeline_config = PipelineConfig(model_path=model_path)
settings = Settings()
llm = LLM(settings, pipeline_config)
prompts = [
"In the beginning, there was",
"I believe the meaning of life is",
"The fastest way to learn python is",
]
print("Generating responses...")
responses = llm.generate(prompts, max_new_tokens=50)
for i, (prompt, response) in enumerate(zip(prompts, responses)):
print(f"========== Response {i} ==========")
print(prompt + response)
print()
if __name__ == "__main__":
main()
Run it and you should see a response similar to this:
python offline-inference.py
python offline-inference.py
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's
========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is
========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:
1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's
========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is
========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:
1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
More information about this API is available in the offline inference guide.
Run inference with an endpoint
Now let's start a local server that runs the model using an OpenAI-compatible endpoint:
-
Install the
openai
client library:- pip
- uv
- magic
pip install openai
pip install openai
uv add openai
uv add openai
magic add openai
magic add openai
-
Start the endpoint with the
max
CLI:max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
-
Create a new file that sends an inference request:
generate-text.pyfrom openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:8000/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{
"role": "user",
"content": "Who won the world series in 2020?"
},
],
)
print(completion.choices[0].message.content)from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:8000/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{
"role": "user",
"content": "Who won the world series in 2020?"
},
],
)
print(completion.choices[0].message.content)Notice that the
OpenAI
API requires theapi_key
argument, but our endpoint doesn't use it. -
Run it and you should see results like this:
python generate-text.py
python generate-text.py
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
That's it. You just served Llama 3 on your local CPU and ran inference using our OpenAI-compatible Serve API.
You can also deploy the same endpoint to a cloud GPU using our Docker container.
To run a different model, change the --model-path
to something else from our
model repository.
Stay in touch
Get the latest updates
Stay up to date with announcements and releases. We're moving fast over here.
Talk to an AI Expert
Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.
Try a tutorial
For a more detailed walkthrough of how to build and deploy with MAX, check out these tutorials.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!