MAX Serve API reference
MAX Serve is a high-performance, Python-based inference server for deploying large language models (LLMs) locally or in the cloud. It provides efficient request handling through advanced batching and scheduling, and an OpenAI-compatible REST endpoint.
With just a few commands, you can start a local endpoint with the GenAI
model of your choice using our max-pipelines
CLI
and start sending requests—see our quickstart guide.
When you want to deploy your model to a cloud-hosted endpoint, you can use our MAX container—see our tutorial to deploy an LLM on a GPU.
MAX Serve is compatible with a subset of the OpenAI REST API. This allows you to use many existing OpenAI client applications with MAX Serve endpoints.
Supported endpoints:
- Chat completions (
/v1/chat/completions
) - Completions (
/v1/completions
) - Embeddings (
/v1/embeddings
) - List models (
/v1/models
)
- Chat completions (
Parameter handling: While aiming for high compatibility, not all OpenAI body parameters are implemented. Some may be accepted as no-ops (ignored but won't cause errors) to maintain client compatibility. The specific endpoint documentation below details only the parameters that actively affect behavior in MAX Serve.
In addition to the OpenAI APIs, MAX Serve provides a Prometheus-formatted metrics endpoint to help track your model's performance.
Create chat completion
Creates a completion for the chat message. MAX Serve supports a subset of OpenAI's chat completion parameters. Streaming is supported, but chat-object and chat-streaming response formats have limitations.
Request Body schema: application/jsonrequired
model required | string ID of the model to use. Supported by MAX Serve but handled by frontend load balancer to route traffic to nodes. |
required | Array of objects (ChatCompletionMessage) A list of messages comprising the conversation so far. MAX Serve supports content and role parameters. |
max_tokens | integer >= 1 The maximum number of tokens to generate in the chat completion. |
logprobs | boolean Whether to return log probabilities of the output tokens or not. |
object An object specifying the format that the model must output. | |
stream | boolean Default: false If set, partial message deltas will be sent. |
Array of objects (ChatCompletionTool) A list of tools the model may call. | |
string or object Controls which (if any) tool is called by the model. |
Responses
Request samples
- Payload
{- "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "messages": [
- {
- "role": "system",
- "content": "You are a helpful assistant."
}, - {
- "role": "user",
- "content": "Hello, how are you?"
}
], - "max_tokens": 100
}
Response samples
- 200
- 400
{- "id": "bb41aa81eac24f4b9ed7ecd0ea593815",
- "object": "chat.completion",
- "created": 1743183374,
- "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "choices": [
- {
- "index": 0,
- "message": {
- "role": "assistant",
- "content": " I'm doing well, thank you for asking. How can I assist you today?",
- "refusal": "",
- "tool_calls": null,
- "function_call": null
}, - "finish_reason": "stop",
- "logprobs": {
- "content": [ ],
- "refusal": [ ]
}
}
], - "service_tier": null,
- "system_fingerprint": null,
- "usage": {
- "completion_tokens": 17,
- "prompt_tokens": null,
- "total_tokens": 17
}
}
Create completion
Creates a completion for the provided prompt. MAX Serve supports a subset of OpenAI's completion parameters.
Request Body schema: application/jsonrequired
model required | string ID of the model to use. |
required | string or Array of strings The prompt to generate completions for. |
echo | boolean Default: false Echo back the prompt in addition to the completion. |
max_tokens | integer >= 1 The maximum number of tokens to generate in the completion. |
stream | boolean Default: false Whether to stream back partial progress. |
logprobs | integer >= 0 Include the log probabilities on the logprobs most likely tokens. |
Responses
Request samples
- Payload
{- "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "prompt": "Once upon a time",
- "max_tokens": 50
}
Response samples
- 200
- 400
{- "id": "2dd7bad9bd1b4caea0114cb292f5b23a",
- "object": "text_completion",
- "created": 1743183506,
- "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "choices": [
- {
- "index": 0,
- "text": " The story of a young girl who was born with a rare condition that made her skin extremely sensitive to the sun. She had to stay indoors during the day and only go out at night. She was a bit of an outcast among her peers,",
- "finish_reason": "stop",
- "logprobs": {
- "text_offset": null,
- "token_logprobs": [ ],
- "tokens": null,
- "top_logprobs": [ ]
}
}
], - "system_fingerprint": null,
- "usage": null
}
Create embeddings
Creates an embedding vector representing the input text. MAX Serve supports sentence-transformers/all-mpnet-base-v2 model. Only string and list of strings are supported for the input parameter.
Request Body schema: application/jsonrequired
model required | string ID of the model to use. MAX Serve supports sentence-transformers/all-mpnet-base-v2. |
required | string or Array of strings The text to embed. MAX Serve supports strings and list of strings. |
encoding_format | string Default: "float" Enum: "float" "base64" The format to return the embeddings in. |
dimensions | integer The number of dimensions the resulting output embeddings should have. |
user | string A unique identifier representing your end-user. |
Responses
Request samples
- Payload
{- "model": "sentence-transformers/all-mpnet-base-v2",
- "input": "The food was delicious and the service was excellent."
}
Response samples
- 200
- 400
{- "object": "list",
- "data": [
- {
- "object": "embedding",
- "embedding": [
- 0.0023064255,
- -0.009327292,
- 0.01842359,
- -0.0028842522,
- 0.0044457657
], - "index": 0
}
], - "model": "sentence-transformers/all-mpnet-base-v2"
}
List models
Lists the currently available models. MAX Serve only returns the one model that it is currently serving.
Responses
Response samples
- 200
- 400
{- "object": "list",
- "data": [
- {
- "id": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "object": "model",
- "created": null,
- "owned_by": ""
}
]
}