
Start a chat endpoint
The MAX framework simplifies the process to serve open source models with the same API interface as OpenAI. This allows you to replace commercial models with alternatives from the MAX Builds site with minimal code changes.
This tutorial shows you how to serve Llama 3.1 locally with the max
CLI and interact with it through REST and Python APIs. You'll learn to configure
the server and make requests using the OpenAI client libraries as a drop-in
replacement.
System requirements:
Mac
Linux
WSL
Set up your environment
Create a Python project to install our APIs and CLI tools:
- pip
- uv
- magic
- Create a project folder:
mkdir chat-tutorial && cd chat-tutorial
mkdir chat-tutorial && cd chat-tutorial
- Create and activate a virtual environment:
python3 -m venv .venv/chat-tutorial \
&& source .venv/chat-tutorial/bin/activatepython3 -m venv .venv/chat-tutorial \
&& source .venv/chat-tutorial/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpupip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu
- Install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init chat-tutorial && cd chat-tutorial
uv init chat-tutorial && cd chat-tutorial
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-strategy unsafe-best-match
- Install
magic
:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. - Create a project:
magic init chat-tutorial --format pyproject && cd chat-tutorial
magic init chat-tutorial --format pyproject && cd chat-tutorial
- Install the
max-pipelines
conda package:- Nightly
- Stable
magic add max-pipelines
magic add max-pipelines
magic add "max-pipelines==25.3"
magic add "max-pipelines==25.3"
- Start the virtual environment:
magic shell
magic shell
Serve your model
Use the max serve
command to start a
local model server with the Llama 3.1 model:
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
max serve \
--model-path modularai/Llama-3.1-8B-Instruct-GGUF
While this example uses the Llama 3.1 model, you can replace it with any of the models listed in the MAX Builds site.
The server is ready when you see a message indicating it's running on http://0.0.0.0:8000:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
For a complete list of max
CLI commands and options, refer to the
MAX CLI reference.
Interact with the model
After the server is running, you can interact with the model using different
methods. The MAX endpoint supports OpenAI REST APIs, so you can
send requests from your client using the openai
Python API.
- Python
- cURL
You can use OpenAI's Python client to interact with the model.
To get started, install the OpenAI Python client:
pip install openai
pip install openai
Then, create a client and make a request to the model:
from openai import OpenAI
client = OpenAI(
base_url = 'http://0.0.0.0:8000/v1',
api_key='EMPTY', # required by the API, but not used by MAX
)
response = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The LA Dodgers won in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
print(response.choices[0].message.content)
from openai import OpenAI
client = OpenAI(
base_url = 'http://0.0.0.0:8000/v1',
api_key='EMPTY', # required by the API, but not used by MAX
)
response = client.chat.completions.create(
model="modularai/Llama-3.1-8B-Instruct-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The LA Dodgers won in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
print(response.choices[0].message.content)
In this example, you're using the OpenAI Python client to interact with the MAX
endpoint running on local host 8000
. The client
object is initialized with
the base URL http://0.0.0.0:8000/v1
and the API key is ignored.
When you run this code, the model should respond with information about the 2020 World Series location:
python generate-text.py
python generate-text.py
The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.
The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.
The following curl
command sends a simple chat request to the model's chat
completions endpoint:
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"max_tokens": 100
}'
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"max_tokens": 100
}'
You should receive a response similar to this:
{
"id": "18b0abd2d2fd463ea43efe2c147bcac0",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": " I'm doing well, thank you for asking. How can I assist you today?",
"refusal": "",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": {
"content": [],
"refusal": []
}
}
],
"created": 1743543698,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": null,
"total_tokens": 17
}
}
{
"id": "18b0abd2d2fd463ea43efe2c147bcac0",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": " I'm doing well, thank you for asking. How can I assist you today?",
"refusal": "",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": {
"content": [],
"refusal": []
}
}
],
"created": 1743543698,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": null,
"total_tokens": 17
}
}
For complete details on all available API endpoints and options, see the MAX Serve API documentation.
Next steps
Now that you have successfully set up MAX with OpenAI-compatible endpoints, checkout out these other tutorials:
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!