MAX REST API reference

MAX is a high-performance, Python-based inference server for deploying large language models (LLMs) locally or in the cloud. It provides efficient request handling through advanced batching and scheduling, and an OpenAI-compatible REST endpoint.

Get started

With just a few commands, you can start a local endpoint with the GenAI model of your choice using our max CLI and start sending requests.

When you want to deploy your model to a cloud-hosted endpoint, you can use our MAX container—see our Get started with MAX and then see our tutorials on using MAX:

OpenAI API compatibility

The MAX REST API is compatible with a subset of the OpenAI REST API. This allows you to use many existing OpenAI client applications with MAX endpoints.

Supported endpoints:
- Chat completions (/v1/chat/completions)
- Completions (/v1/completions)
- Embeddings (/v1/embeddings)
- Batches (/v1/batches) - The v1/batches API is only available to users in the Mammoth public preview.
- List models (/v1/models)
- Health check (/health)
Parameter handling: While aiming for high compatibility, not all OpenAI body parameters are implemented. Some may be accepted as no-ops (ignored but won't cause errors) to maintain client compatibility. The specific endpoint documentation below details only the parameters that actively affect behavior in MAX.

In addition to the OpenAI APIs, MAX provides a Prometheus-formatted metrics endpoint to help track your model's performance.

Create chat completion

Creates a completion for the chat message. MAX supports a subset of OpenAI's chat completion parameters. Streaming is supported, but chat-object and chat-streaming response formats have limitations.

Request Body schema: application/json
required

required	Array of objects (ChatCompletionMessage) A list of messages comprising the conversation so far. MAX supports content and role parameters.
model required	string ID of the model to use. Supported by MAX but handled by frontend load balancer to route traffic to nodes.
logprobs	boolean Whether to return log probabilities of the output tokens or not.
max_tokens	integer >= 1 The maximum number of tokens to generate in the chat completion.
	object An object specifying the format that the model must output.
stream	boolean Default: false If set, partial message deltas will be sent.
	string or object Controls which (if any) tool is called by the model.
	Array of objects (ChatCompletionTool) A list of tools the model may call.

Responses

Request samples

Payload
Python
curl

Content type

application/json

Example

Basic chat completion

{"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"messages": [{"role": "user",
"content": "Who won the world series in 2020?"
}
],
"max_tokens": 100
}

Response samples

200
400

Content type

application/json

Example

Basic chat completion response

{"id": "bb41aa81eac24f4b9ed7ecd0ea593815",
"object": "chat.completion",
"created": 1743183374,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"choices": [{"index": 0,
"message": {"role": "assistant",
"content": "The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.",
"refusal": "",
"tool_calls": null,
"function_call": null
},
"finish_reason": "stop",
"logprobs": {"content": [ ],
"refusal": [ ]
}
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {"completion_tokens": 17,
"prompt_tokens": null,
"total_tokens": 17
}
}

Create completion

Creates a completion for the provided prompt. MAX supports a subset of OpenAI's completion parameters.

Request Body schema: application/json
required

model required	string ID of the model to use.
required	string or Array of strings The prompt to generate completions for.
echo	boolean Default: false Echo back the prompt in addition to the completion.
logprobs	integer >= 0 Include the log probabilities on the logprobs most likely tokens.
max_tokens	integer >= 1 The maximum number of tokens to generate in the completion.
stream	boolean Default: false Whether to stream back partial progress.

Responses

Request samples

Payload
Python
curl

Content type

application/json

Example

Basic completion

{"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"prompt": "Once upon a time",
"max_tokens": 50
}

Response samples

200
400

Content type

application/json

Example

Basic completion response

{"id": "2dd7bad9bd1b4caea0114cb292f5b23a",
"object": "text_completion",
"created": 1743183506,
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"choices": [{"index": 0,
"text": ", there was a brave knight who lived in a small village at the edge of a great forest. The knight was known throughout the land for his kindness and courage.",
"finish_reason": "stop",
"logprobs": {"text_offset": null,
"token_logprobs": [ ],
"tokens": null,
"top_logprobs": [ ]
}
}
],
"system_fingerprint": null,
"usage": null
}

Create embeddings

Creates an embedding vector representing the input text. MAX supports sentence-transformers/all-mpnet-base-v2 model. Only string and list of strings are supported for the input parameter.

Request Body schema: application/json
required

required	string or Array of strings The text to embed. MAX supports strings and list of strings.
model required	string ID of the model to use. MAX supports sentence-transformers/all-mpnet-base-v2.
dimensions	integer The number of dimensions the resulting output embeddings should have.
encoding_format	string Default: "float" Enum: "float" "base64" The format to return the embeddings in.
user	string A unique identifier representing your end-user.

Responses

Request samples

Payload
Python
curl

Content type

application/json

Example

Single string embedding

{"model": "sentence-transformers/all-mpnet-base-v2",
"input": "The food was delicious and the service was excellent."
}

Response samples

200
400

Content type

application/json

Example

Single string embedding response

{"object": "list",
"data": [{"object": "embedding",
"embedding": [0.0023064255,
-0.009327292,
0.01842359,
-0.0028842522,
0.0044457657
],
"index": 0
}
],
"model": "sentence-transformers/all-mpnet-base-v2"
}

Create batch

Creates and executes a batch from an uploaded file of requests.

Note: The v1/batches API is only available to users in the Mammoth public preview. If you create a batch and get the response {"detail":"Not Found"}, then you don't have access. Get in touch to learn about early access for enterprise teams.

Request Body schema: application/json
required

completion_window required	string The time frame within which the batch should be processed. Supported completion windows include: `6h`, `12h`, `24h`, or `7d`.
endpoint required	string The endpoint to be used for all requests in the batch. Currently only the `/v1/chat/completions` endpoint is supported.
input_file_id required	string The path to an uploaded file that contains requests for the new batch.
metadata required	map A set of key-value pairs that can be attached to an object. The `metadata["model"]` field is required.
output_file_id required	string The path to the batch output file.

Responses

Request samples

Payload
Python
curl

Content type

application/json

{"completion_window": "24h",
"endpoint": "/v1/chat/completions",
"input_file_id": "input_path",
"metadata": {"model": "model_provider/model_ID",
"metadata_key": "metadata_value"
},
"output_file_id": "output_path"
}

Response samples

200
400

Content type

application/json

{"id": "batch_ID",
"object": "batch",
"endpoint": "/v1/chat/completions",
"errors": null,
"input_file_id": "input_path",
"completion_window": "24h",
"status": "validating",
"output_file_id": "output_path",
"error_file_id": null,
"created_at": 1711471533,
"in_progress_at": null,
"expires_at": null,
"finalizing_at": null,
"completed_at": null,
"failed_at": null,
"expired_at": null,
"cancelling_at": null,
"cancelled_at": null,
"request_counts": {"total": 0,
"completed": 0,
"failed": 0
},
"metadata": {"model": "model_provider/model_ID",
"metadata_key": "metadata_value"
}
}

List models

Lists the currently available models. MAX returns only the one model that it's currently serving.

Responses

Request samples

Python
curl

from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:8000/v1",
    api_key="EMPTY",
)

models = client.models.list()

for model in models.data:
    print(f"Model ID: {model.id}")
    print(f"Object Type: {model.object}")
    print(f"Owned By: {model.owned_by}")

Response samples

200
400

Content type

application/json

{"object": "list",
"data": [{"id": "modularai/Llama-3.1-8B-Instruct-GGUF",
"object": "model",
"created": null,
"owned_by": ""
}
]
}

Health check

Returns the health status of the service. Used by tools like lm-eval to determine when the service is ready to accept requests.

Responses

Request samples

curl

curl -X GET http://0.0.0.0:8000/health

Response samples

200
503

Content type

application/json

{"status": "ok"
}

MAX REST API reference

Get started

OpenAI API compatibility

Create chat completion

Request Body schema: application/jsonrequired

Responses

Request samples

Response samples

Create completion

Request Body schema: application/jsonrequired

Responses

Request samples

Response samples

Create embeddings

Request Body schema: application/jsonrequired

Responses

Request samples

Response samples

Create batch

Request Body schema: application/jsonrequired

Responses

Request samples

Response samples

List models

Responses

Request samples

Response samples

Health check

Responses

Request samples

Response samples

Request Body schema: application/json
required

Request Body schema: application/json
required

Request Body schema: application/json
required

Request Body schema: application/json
required