MAX REST API reference
MAX is a high-performance, Python-based inference server for deploying large language models (LLMs) locally or in the cloud. It provides efficient request handling through advanced batching and scheduling, and an OpenAI-compatible REST endpoint.
With just a few commands, you can start a local endpoint with the GenAI
model of your choice using our max
CLI
and start sending requests.
When you want to deploy your model to a cloud-hosted endpoint, you can use our MAX container—see our Get started with MAX and then see our tutorials on using MAX:
The MAX REST API is compatible with a subset of the OpenAI REST API. This allows you to use many existing OpenAI client applications with MAX endpoints.
Supported endpoints:
- Chat completions (
/v1/chat/completions
) - Completions (
/v1/completions
) - Embeddings (
/v1/embeddings
) - Batches (
/v1/batches
) - Thev1/batches
API is only available to users in the Mammoth public preview. - List models (
/v1/models
)
- Chat completions (
Parameter handling: While aiming for high compatibility, not all OpenAI body parameters are implemented. Some may be accepted as no-ops (ignored but won't cause errors) to maintain client compatibility. The specific endpoint documentation below details only the parameters that actively affect behavior in MAX.
In addition to the OpenAI APIs, MAX provides a Prometheus-formatted metrics endpoint to help track your model's performance.
Create chat completion
Creates a completion for the chat message. MAX supports a subset of OpenAI's chat completion parameters. Streaming is supported, but chat-object and chat-streaming response formats have limitations.
Request Body schema: application/jsonrequired
model required | string ID of the model to use. Supported by MAX but handled by frontend load balancer to route traffic to nodes. |
required | Array of objects (ChatCompletionMessage) A list of messages comprising the conversation so far. MAX supports content and role parameters. |
max_tokens | integer >= 1 The maximum number of tokens to generate in the chat completion. |
logprobs | boolean Whether to return log probabilities of the output tokens or not. |
object An object specifying the format that the model must output. | |
stream | boolean Default: false If set, partial message deltas will be sent. |
Array of objects (ChatCompletionTool) A list of tools the model may call. | |
string or object Controls which (if any) tool is called by the model. |
Responses
Request samples
- Payload
- Python
- curl
{- "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "messages": [
- {
- "role": "user",
- "content": "Who won the world series in 2020?"
}
], - "max_tokens": 100
}
Response samples
- 200
- 400
{- "id": "bb41aa81eac24f4b9ed7ecd0ea593815",
- "object": "chat.completion",
- "created": 1743183374,
- "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "choices": [
- {
- "index": 0,
- "message": {
- "role": "assistant",
- "content": "The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.",
- "refusal": "",
- "tool_calls": null,
- "function_call": null
}, - "finish_reason": "stop",
- "logprobs": {
- "content": [ ],
- "refusal": [ ]
}
}
], - "service_tier": null,
- "system_fingerprint": null,
- "usage": {
- "completion_tokens": 17,
- "prompt_tokens": null,
- "total_tokens": 17
}
}
Create completion
Creates a completion for the provided prompt. MAX supports a subset of OpenAI's completion parameters.
Request Body schema: application/jsonrequired
model required | string ID of the model to use. |
required | string or Array of strings The prompt to generate completions for. |
echo | boolean Default: false Echo back the prompt in addition to the completion. |
max_tokens | integer >= 1 The maximum number of tokens to generate in the completion. |
stream | boolean Default: false Whether to stream back partial progress. |
logprobs | integer >= 0 Include the log probabilities on the logprobs most likely tokens. |
Responses
Request samples
- Payload
- Python
- curl
{- "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "prompt": "Once upon a time",
- "max_tokens": 50
}
Response samples
- 200
- 400
{- "id": "2dd7bad9bd1b4caea0114cb292f5b23a",
- "object": "text_completion",
- "created": 1743183506,
- "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "choices": [
- {
- "index": 0,
- "text": ", there was a brave knight who lived in a small village at the edge of a great forest. The knight was known throughout the land for his kindness and courage.",
- "finish_reason": "stop",
- "logprobs": {
- "text_offset": null,
- "token_logprobs": [ ],
- "tokens": null,
- "top_logprobs": [ ]
}
}
], - "system_fingerprint": null,
- "usage": null
}
Create embeddings
Creates an embedding vector representing the input text. MAX supports sentence-transformers/all-mpnet-base-v2 model. Only string and list of strings are supported for the input parameter.
Request Body schema: application/jsonrequired
model required | string ID of the model to use. MAX supports sentence-transformers/all-mpnet-base-v2. |
required | string or Array of strings The text to embed. MAX supports strings and list of strings. |
encoding_format | string Default: "float" Enum: "float" "base64" The format to return the embeddings in. |
dimensions | integer The number of dimensions the resulting output embeddings should have. |
user | string A unique identifier representing your end-user. |
Responses
Request samples
- Payload
- Python
- curl
{- "model": "sentence-transformers/all-mpnet-base-v2",
- "input": "The food was delicious and the service was excellent."
}
Response samples
- 200
- 400
{- "object": "list",
- "data": [
- {
- "object": "embedding",
- "embedding": [
- 0.0023064255,
- -0.009327292,
- 0.01842359,
- -0.0028842522,
- 0.0044457657
], - "index": 0
}
], - "model": "sentence-transformers/all-mpnet-base-v2"
}
Create batch
Creates and executes a batch from an uploaded file of requests.
Note: The v1/batches
API is only available to users in the Mammoth public preview.
If you create a batch and get the response {"detail":"Not Found"}
, then you don't have access.
Get in touch to learn about early access for enterprise teams.
Request Body schema: application/jsonrequired
completion_window required | string The time frame within which the batch should be processed.
Supported completion windows include: |
endpoint required | string The endpoint to be used for all requests in the batch.
Currently only the |
input_file_id required | string The path to an uploaded file that contains requests for the new batch. |
output_file_id required | string The path to the batch output file. |
metadata required | map A set of key-value pairs that can be attached to an object. The
|
Responses
Request samples
- Payload
- Python
- curl
{- "completion_window": "24h",
- "endpoint": "/v1/chat/completions",
- "input_file_id": "input_path",
- "metadata": {
- "model": "model_provider/model_ID",
- "metadata_key": "metadata_value"
}, - "output_file_id": "output_path"
}
Response samples
- 200
- 400
{- "id": "batch_ID",
- "object": "batch",
- "endpoint": "/v1/chat/completions",
- "errors": null,
- "input_file_id": "input_path",
- "completion_window": "24h",
- "status": "validating",
- "output_file_id": "output_path",
- "error_file_id": null,
- "created_at": 1711471533,
- "in_progress_at": null,
- "expires_at": null,
- "finalizing_at": null,
- "completed_at": null,
- "failed_at": null,
- "expired_at": null,
- "cancelling_at": null,
- "cancelled_at": null,
- "request_counts": {
- "total": 0,
- "completed": 0,
- "failed": 0
}, - "metadata": {
- "model": "model_provider/model_ID",
- "metadata_key": "metadata_value"
}
}
List models
Lists the currently available models. MAX returns only the one model that it's currently serving.
Responses
Request samples
- Python
- curl
from openai import OpenAI client = OpenAI( base_url="http://0.0.0.0:8000/v1", api_key="EMPTY", ) models = client.models.list() for model in models.data: print(f"Model ID: {model.id}") print(f"Object Type: {model.object}") print(f"Owned By: {model.owned_by}")
Response samples
- 200
- 400
{- "object": "list",
- "data": [
- {
- "id": "modularai/Llama-3.1-8B-Instruct-GGUF",
- "object": "model",
- "created": null,
- "owned_by": ""
}
]
}