Skip to main content
Log in

MAX Serve API reference

MAX Serve is a high-performance, Python-based inference server for deploying large language models (LLMs) locally or in the cloud. It provides efficient request handling through advanced batching and scheduling, and an OpenAI-compatible REST endpoint.

Getting started

With just a few commands, you can start a local endpoint with the GenAI model of your choice using our max-pipelines CLI and start sending requests—see our quickstart guide.

When you want to deploy your model to a cloud-hosted endpoint, you can use our MAX container—see our tutorial to deploy an LLM on a GPU.

OpenAI API compatibility

MAX Serve is compatible with a subset of the OpenAI REST API. This allows you to use many existing OpenAI client applications with MAX Serve endpoints.

  • Supported endpoints:

  • Parameter handling: While aiming for high compatibility, not all OpenAI body parameters are implemented. Some may be accepted as no-ops (ignored but won't cause errors) to maintain client compatibility. The specific endpoint documentation below details only the parameters that actively affect behavior in MAX Serve.

In addition to the OpenAI APIs, MAX Serve provides a Prometheus-formatted metrics endpoint to help track your model's performance.

Create chat completion

Creates a completion for the chat message. MAX Serve supports a subset of OpenAI's chat completion parameters. Streaming is supported, but chat-object and chat-streaming response formats have limitations.

Request Body schema: application/json
required
model
required
string

ID of the model to use. Supported by MAX Serve but handled by frontend load balancer to route traffic to nodes.

required
Array of objects (ChatCompletionMessage)

A list of messages comprising the conversation so far. MAX Serve supports content and role parameters.

max_tokens
integer >= 1

The maximum number of tokens to generate in the chat completion.

logprobs
boolean

Whether to return log probabilities of the output tokens or not.

object

An object specifying the format that the model must output.

stream
boolean
Default: false

If set, partial message deltas will be sent.

Array of objects (ChatCompletionTool)

A list of tools the model may call.

string or object

Controls which (if any) tool is called by the model.

Responses

Request samples

Content type
application/json
Example
{
  • "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
  • "messages": [
    ],
  • "max_tokens": 100
}

Response samples

Content type
application/json
Example
{
  • "id": "bb41aa81eac24f4b9ed7ecd0ea593815",
  • "object": "chat.completion",
  • "created": 1743183374,
  • "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
  • "choices": [
    ],
  • "service_tier": null,
  • "system_fingerprint": null,
  • "usage": {
    }
}

Create completion

Creates a completion for the provided prompt. MAX Serve supports a subset of OpenAI's completion parameters.

Request Body schema: application/json
required
model
required
string

ID of the model to use.

required
string or Array of strings

The prompt to generate completions for.

echo
boolean
Default: false

Echo back the prompt in addition to the completion.

max_tokens
integer >= 1

The maximum number of tokens to generate in the completion.

stream
boolean
Default: false

Whether to stream back partial progress.

logprobs
integer >= 0

Include the log probabilities on the logprobs most likely tokens.

Responses

Request samples

Content type
application/json
Example
{
  • "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
  • "prompt": "Once upon a time",
  • "max_tokens": 50
}

Response samples

Content type
application/json
Example
{
  • "id": "2dd7bad9bd1b4caea0114cb292f5b23a",
  • "object": "text_completion",
  • "created": 1743183506,
  • "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
  • "choices": [
    ],
  • "system_fingerprint": null,
  • "usage": null
}

Create embeddings

Creates an embedding vector representing the input text. MAX Serve supports sentence-transformers/all-mpnet-base-v2 model. Only string and list of strings are supported for the input parameter.

Request Body schema: application/json
required
model
required
string

ID of the model to use. MAX Serve supports sentence-transformers/all-mpnet-base-v2.

required
string or Array of strings

The text to embed. MAX Serve supports strings and list of strings.

encoding_format
string
Default: "float"
Enum: "float" "base64"

The format to return the embeddings in.

dimensions
integer

The number of dimensions the resulting output embeddings should have.

user
string

A unique identifier representing your end-user.

Responses

Request samples

Content type
application/json
Example
{
  • "model": "sentence-transformers/all-mpnet-base-v2",
  • "input": "The food was delicious and the service was excellent."
}

Response samples

Content type
application/json
Example
{
  • "object": "list",
  • "data": [
    ],
  • "model": "sentence-transformers/all-mpnet-base-v2"
}

List models

Lists the currently available models. MAX Serve only returns the one model that it is currently serving.

Responses

Response samples

Content type
application/json
{
  • "object": "list",
  • "data": [
    ]
}