max serve

Launches a model server with an OpenAI-compatible endpoint. Just specify the model as a Hugging Face model ID or a local path.

For example:

max serve \
  --model google/gemma-3-12b-it \
  --devices gpu:0 \
  --max-batch-size 8 \
  --device-memory-utilization 0.9

For details about the endpoint APIs provided by the server, see the MAX REST API reference.

The max CLI also supports loading custom model architectures through the --custom-architectures flag. This allows you to extend MAX’s capabilities with your own model implementations:

max serve \
  --model google/gemma-3-12b-it \
  --custom-architectures path/to/module1:module1 \
  --custom-architectures path/to/module2:module2

Custom architectures

The --custom-architectures flag allows you to load custom pipeline architectures from your own Python modules. You can set the ARCHITECTURES variable containing the architecture definitions. Each entry in --custom-architectures can be specified in two formats:

A raw module name; for example: my_module.
An import path followed by a colon and the module name; for example: folder/path/to/import:my_module.

The ARCHITECTURES variable in your module should be a list of implementations that conform to the SupportedArchitecture interface. These will be registered with the MAX pipeline registry automatically.

Quantization encoding

When using GGUF models, quantization encoding formats are automatically detected. If no --quantization-encoding is specified, MAX Serve automatically detects and uses the first encoding option from the repository. If quantization encoding is provided, it must align with the available encoding options in the repository.

If the repository contains multiple quantization formats, specify which encoding type you want to use with the --quantization-encoding parameter.

Usage

max serve [OPTIONS]

Options

--allow-safetensors-weights-fp32-bf6-bidirectional-cast, --no-allow-safetensors-weights-fp32-bf6-bidirectional-cast

Whether to allow automatic float32 to/from bfloat16 safetensors weight type casting, if needed. Currently only supported in Llama3 models.

--cache-strategy <cache_strategy>

The cache strategy to use. This defaults to model_default, which selects the default strategy for the requested architecture. You can also force a specific strategy: continuous or paged.

Options:

model_default | paged

--ce-delay-ms <ce_delay_ms>

Duration of scheduler sleep prior to starting a prefill batch. Experimental for the TTS scheduler.

--chat-template <chat_template>

Optional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If None, the model’s default chat template is used.

--config-file <config_file>

--config-file <config_file>

--config-file <config_file>

--config-file <config_file>

--config-file <config_file>

--config-file <config_file>

--config-file <config_file>

--custom-architectures <custom_architectures>

Custom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name. Each module must expose an ARCHITECTURES list of architectures to register.

--data-parallel-degree <data_parallel_degree>

Data-parallelism parameter. The degree to which the model is replicated is dependent on the model type.

--debug-verify-replay, --no-debug-verify-replay

When device_graph_capture is enabled, execute eager launch-trace verification before replay. Intended for debugging only.

--defer-resolve, --no-defer-resolve

Whether to defer resolving the pipeline config.

--device-graph-capture, --no-device-graph-capture

Enable device graph capture/replay for graph execution.

--device-memory-utilization <device_memory_utilization>

The fraction of available device memory that the process should consume. This informs the KVCache workspace size: kv_cache_workspace = (total_free_memory * device_memory_utilization) - model_weights_size.

--devices <devices>

Whether to run the model on CPU (–devices=cpu), GPU (–devices=gpu) or a list of GPUs (–devices=gpu:0,1) etc. An ID value can be provided optionally to indicate the device ID to target. If not provided, the model will run on the first available GPU (–devices=gpu), or CPU if no GPUs are available (–devices=cpu).

--disk-offload-dir <disk_offload_dir>

Directory for disk-based KV cache offloading. When set (together with kvcache_swapping_to_host), blocks are written through from CPU to disk for persistence across restarts.

--disk-offload-direct-io, --no-disk-offload-direct-io

Use O_DIRECT for disk I/O (bypasses OS page cache). Requires block sizes aligned to the filesystem block size. Falls back to buffered I/O if alignment is not met.

--disk-offload-max-gb <disk_offload_max_gb>

Maximum disk space (GB) for KV cache offloading.

--draft-allow-safetensors-weights-fp32-bf6-bidirectional-cast, --no-draft-allow-safetensors-weights-fp32-bf6-bidirectional-cast

Whether to allow automatic float32 to/from bfloat16 safetensors weight type casting, if needed. Currently only supported in Llama3 models.

--draft-chat-template <draft_chat_template>

Optional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If None, the model’s default chat template is used.

--draft-config-file <draft_config_file>

--draft-data-parallel-degree <draft_data_parallel_degree>

Data-parallelism parameter. The degree to which the model is replicated is dependent on the model type.

--draft-devices <draft_devices>

Whether to run the model on CPU (–devices=cpu), GPU (–devices=gpu) or a list of GPUs (–devices=gpu:0,1) etc. An ID value can be provided optionally to indicate the device ID to target. If not provided, the model will run on the first available GPU (–devices=gpu), or CPU if no GPUs are available (–devices=cpu).

--draft-enable-echo, --no-draft-enable-echo

Whether the model should be built with echo capabilities.

--draft-force-download, --no-draft-force-download

Whether to force download a given file if it’s already present in the local cache.

--draft-huggingface-model-revision <draft_huggingface_model_revision>

Branch or Git revision of Hugging Face model repository to use.

--draft-huggingface-weight-revision <draft_huggingface_weight_revision>

Branch or Git revision of Hugging Face model repository to use.

--draft-max-length <draft_max_length>

Maximum sequence length the model can process. If not specified, defaults to the model’s max_position_embeddings. May be clamped during resolution based on available memory.

--draft-model-path <draft_model_path>

The repository ID of a Hugging Face model to use. The –model option also works as an alias.

--draft-pool-embeddings, --no-draft-pool-embeddings

Whether to pool embedding outputs.

--draft-quantization-encoding <draft_quantization_encoding>

Weight encoding type.

Options:

float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq

--draft-rope-type <draft_rope_type>

Force using a specific rope type: none, normal, or neox. Only matters for GGUF weights.

Options:

none | normal | neox | longrope | yarn

--draft-section-name <draft_section_name>

--draft-served-model-name <draft_served_model_name>

Optional override for client-facing model name. Defaults to model_path.

--draft-trust-remote-code, --no-draft-trust-remote-code

Whether or not to allow for custom modelling files on Hugging Face.

--draft-use-subgraphs, --no-draft-use-subgraphs

Whether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.

--draft-vision-config-overrides <draft_vision_config_overrides>

Model-specific vision configuration overrides. For example, for InternVL: {“max_dynamic_patch”: 24}.

--draft-weight-path <draft_weight_path>

Optional path or url of the model weights to use.

--enable-chunked-prefill, --no-enable-chunked-prefill

Enable chunked prefill to split context encoding requests into multiple chunks based on max_batch_input_tokens.

--enable-echo, --no-enable-echo

Whether the model should be built with echo capabilities.

--enable-in-flight-batching, --no-enable-in-flight-batching

When enabled, prioritizes token generation by batching it with context encoding requests.

--enable-kvcache-swapping-to-host, --no-enable-kvcache-swapping-to-host

Whether to swap paged KVCache blocks to host memory when device blocks are evicted.

--enable-lora, --no-enable-lora

Enables LoRA on the server.

--enable-min-tokens, --no-enable-min-tokens

Whether to enable min_tokens, which blocks the model from generating stopping tokens before the min_tokens count is reached.

--enable-overlap-scheduler, --no-enable-overlap-scheduler

Whether to enable the overlap scheduler. This feature allows the scheduler to run alongside GPU execution. This helps improve GPU utilization. This is an experimental feature which may crash and burn. This feature will be enabled by default for some selected architectures. You can forcibly disable this by setting –no-enable-overlap-scheduler –force.

--enable-penalties, --no-enable-penalties

Whether to apply frequency and presence penalties to the model’s output.

--enable-prefix-caching, --no-enable-prefix-caching

Whether to enable prefix caching for the paged KVCache.

--enable-prioritize-first-decode, --no-enable-prioritize-first-decode

When enabled, the scheduler always runs a TG batch immediately after a CE batch with the same requests. This may reduce time-to-first-chunk latency. Experimental for the TTS scheduler.

--enable-structured-output, --no-enable-structured-output

Enable structured generation/guided decoding for the server. This allows the user to pass a json schema in the response_format field, which the LLM will adhere to.

--enable-variable-logits, --no-enable-variable-logits

Enable the sampling graph to accept a ragged tensor of different sequences as inputs, along with their associated logit_offsets. This is needed to produce additional logits for echo and speculative decoding purposes.

--ep-size <ep_size>

The expert parallelism size. Needs to be 1 (no expert parallelism) or the total number of GPUs across nodes.

--execute-empty-batches, --no-execute-empty-batches

Whether the scheduler should execute empty batches.

--force, --no-force

Skip validation of user provided flags against the architecture’s required arguments.

--force-download, --no-force-download

Whether to force download a given file if it’s already present in the local cache.

--gpu-profiling <gpu_profiling>

Whether to enable GPU profiling of the model.

Options:

off | on | detailed

--headless

Run only the dispatcher service and model worker without the API server.

Default:

False

--host-kvcache-swap-space-gb <host_kvcache_swap_space_gb>

The amount of host memory to use for the host KVCache in GiB. This space is only allocated when kvcache_swapping_to_host is enabled.

--huggingface-model-revision <huggingface_model_revision>

Branch or Git revision of Hugging Face model repository to use.

--huggingface-weight-revision <huggingface_weight_revision>

Branch or Git revision of Hugging Face model repository to use.

--kv-cache-format <kv_cache_format>

Override the default data type for the KV cache.Supported values: float32, bfloat16, float8_e4m3fn.

--kv-cache-page-size <kv_cache_page_size>

The number of tokens in a single page in the paged KVCache.

--kvcache-ce-watermark <kvcache_ce_watermark>

Projected cache usage threshold for scheduling CE requests, considering current and incoming requests. CE is scheduled if either projected usage stays below this threshold or no active requests exist. Higher values can cause more preemptions.

--lmcache-config-file <lmcache_config_file>

Path to an LMCache YAML configuration file. When set, enables LMCache-based external KV cache tiering (CPU, disk, remote).

--log-prefix <log_prefix>

Optional prefix to add to all log messages for this server instance.

--lora-paths <lora_paths>

List of statically defined LoRA paths.

--max-batch-input-tokens <max_batch_input_tokens>

The target number of un-encoded tokens to include in each batch. This value is used for chunked prefill and memory estimation.

--max-batch-size <max_batch_size>

Maximum batch size to execute with the model. When not specified (None), this value is determined dynamically. For server launches, set this higher based on server capacity. When device_graph_capture is enabled, overlap pre-captures decode graph entries for batch sizes [1..max_batch_size].

--max-batch-total-tokens <max_batch_total_tokens>

Ensures the sum of page-aligned context lengths in a batch does not exceed max_batch_total_tokens. Alignment uses the KV cache page size. If None, the sum is not limited.

--max-length <max_length>

Maximum sequence length the model can process. If not specified, defaults to the model’s max_position_embeddings. May be clamped during resolution based on available memory.

--max-lora-rank <max_lora_rank>

Maximum rank of all possible LoRAs.

--max-num-loras <max_num_loras>

The maximum number of active LoRAs in a batch. This controls how many LoRA adapters can be active simultaneously during inference. Lower values reduce memory usage but limit concurrent adapter usage.

--max-num-steps <max_num_steps>

The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).

--max-queue-size-tg <max_queue_size_tg>

Maximum number of requests in decode queue. By default, this is max_batch_size.

--min-batch-size-tg <min_batch_size_tg>

Soft floor on the decode batch size. If the TG batch size is larger, the scheduler continues TG batches; if it falls below, the scheduler prioritizes CE. This is not a strict minimum. By default, this is max_queue_size_tg. Experimental for the TTS scheduler.

--model-path <model_path>

The repository ID of a Hugging Face model to use. The –model option also works as an alias.

--num-speculative-tokens <num_speculative_tokens>

The number of speculative tokens.

--pipeline-role <pipeline_role>

Whether the pipeline should serve both a prefill or decode role or both.

Options:

prefill_and_decode | prefill_only | decode_only

--pool-embeddings, --no-pool-embeddings

Whether to pool embedding outputs.

--port <port>

Port to run the server on.

--prefer-module-v3, --no-prefer-module-v3

Whether to prefer the ModuleV3 architecture (default=False for backward compatibility). When False, tries the ModuleV2 architecture first and falls back to ModuleV3. When True, tries ModuleV3 first and falls back to ModuleV2.

--pretty-print-config

Pretty Print Entire Config

--profile-serve

Whether to enable pyinstrument profiling on the serving endpoint.

Default:

False

--quantization-encoding <quantization_encoding>

Weight encoding type.

Options:

float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq

--rope-type <rope_type>

Force using a specific rope type: none, normal, or neox. Only matters for GGUF weights.

Options:

none | normal | neox | longrope | yarn

--section-name <section_name>

--section-name <section_name>

--section-name <section_name>

--section-name <section_name>

--section-name <section_name>

--section-name <section_name>

--section-name <section_name>

--served-model-name <served_model_name>

Optional override for client-facing model name. Defaults to model_path.

--sim-failure <sim_failure>

Simulate fake-perf with failure percentage

--speculative-method <speculative_method>

The speculative decoding method to use.

Options:

standalone | eagle | mtp

--task <task>

The task to run.

--task-arg <task_arg>

Task-specific arguments to pass to the underlying model (can be used multiple times).

--trust-remote-code, --no-trust-remote-code

Whether or not to allow for custom modelling files on Hugging Face.

--use-experimental-kernels <use_experimental_kernels>

Enables using experimental mojo kernels with max serve. The kernels could be unstable or incorrect.

--use-subgraphs, --no-use-subgraphs

Whether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.

--use-vendor-blas <use_vendor_blas>

Enables using vendor BLAS libraries (cublas/hipblas/etc) with max serve. Currently, this just replaces matmul calls.

--vision-config-overrides <vision_config_overrides>

Model-specific vision configuration overrides. For example, for InternVL: {“max_dynamic_patch”: 24}.

--weight-path <weight_path>

Optional path or url of the model weights to use.

--zmq-endpoint-base <zmq_endpoint_base>

Prefix for ZMQ endpoints used for IPC. This ensures unique endpoints across MAX Serve instances on the same host. Example: lora_request_zmq_endpoint = f”{zmq_endpoint_base}-lora_request”.

Usage
Options

Usage​

Options​

--allow-safetensors-weights-fp32-bf6-bidirectional-cast, --no-allow-safetensors-weights-fp32-bf6-bidirectional-cast​

--cache-strategy <cache_strategy>​

--ce-delay-ms <ce_delay_ms>​

--chat-template <chat_template>​

--config-file <config_file>​

--config-file <config_file>​

--config-file <config_file>​

--config-file <config_file>​

--config-file <config_file>​

--config-file <config_file>​

--config-file <config_file>​

--custom-architectures <custom_architectures>​

--data-parallel-degree <data_parallel_degree>​

--debug-verify-replay, --no-debug-verify-replay​

--defer-resolve, --no-defer-resolve​

--device-graph-capture, --no-device-graph-capture​

--device-memory-utilization <device_memory_utilization>​

--devices <devices>​

--disk-offload-dir <disk_offload_dir>​

--disk-offload-direct-io, --no-disk-offload-direct-io​

--disk-offload-max-gb <disk_offload_max_gb>​

--draft-allow-safetensors-weights-fp32-bf6-bidirectional-cast, --no-draft-allow-safetensors-weights-fp32-bf6-bidirectional-cast​

--draft-chat-template <draft_chat_template>​

--draft-config-file <draft_config_file>​

--draft-data-parallel-degree <draft_data_parallel_degree>​

--draft-devices <draft_devices>​

--draft-enable-echo, --no-draft-enable-echo​

--draft-force-download, --no-draft-force-download​

--draft-huggingface-model-revision <draft_huggingface_model_revision>​

--draft-huggingface-weight-revision <draft_huggingface_weight_revision>​

--draft-max-length <draft_max_length>​

--draft-model-path <draft_model_path>​

--draft-pool-embeddings, --no-draft-pool-embeddings​

--draft-quantization-encoding <draft_quantization_encoding>​

--draft-rope-type <draft_rope_type>​

--draft-section-name <draft_section_name>​

--draft-served-model-name <draft_served_model_name>​

--draft-trust-remote-code, --no-draft-trust-remote-code​

--draft-use-subgraphs, --no-draft-use-subgraphs​

--draft-vision-config-overrides <draft_vision_config_overrides>​

--draft-weight-path <draft_weight_path>​

--enable-chunked-prefill, --no-enable-chunked-prefill​

--enable-echo, --no-enable-echo​

--enable-in-flight-batching, --no-enable-in-flight-batching​

--enable-kvcache-swapping-to-host, --no-enable-kvcache-swapping-to-host​

--enable-lora, --no-enable-lora​

--enable-min-tokens, --no-enable-min-tokens​

--enable-overlap-scheduler, --no-enable-overlap-scheduler​

--enable-penalties, --no-enable-penalties​

--enable-prefix-caching, --no-enable-prefix-caching​

--enable-prioritize-first-decode, --no-enable-prioritize-first-decode​

--enable-structured-output, --no-enable-structured-output​

--enable-variable-logits, --no-enable-variable-logits​

--ep-size <ep_size>​

--execute-empty-batches, --no-execute-empty-batches​

--force, --no-force​

--force-download, --no-force-download​

--gpu-profiling <gpu_profiling>​

--headless​

--host-kvcache-swap-space-gb <host_kvcache_swap_space_gb>​

--huggingface-model-revision <huggingface_model_revision>​

--huggingface-weight-revision <huggingface_weight_revision>​

--kv-cache-format <kv_cache_format>​

--kv-cache-page-size <kv_cache_page_size>​

--kvcache-ce-watermark <kvcache_ce_watermark>​

--lmcache-config-file <lmcache_config_file>​

--log-prefix <log_prefix>​

--lora-paths <lora_paths>​

--max-batch-input-tokens <max_batch_input_tokens>​

--max-batch-size <max_batch_size>​

--max-batch-total-tokens <max_batch_total_tokens>​

--max-length <max_length>​

--max-lora-rank <max_lora_rank>​

--max-num-loras <max_num_loras>​

--max-num-steps <max_num_steps>​

--max-queue-size-tg <max_queue_size_tg>​

--min-batch-size-tg <min_batch_size_tg>​

--model-path <model_path>​

Usage

Options

`--allow-safetensors-weights-fp32-bf6-bidirectional-cast, --no-allow-safetensors-weights-fp32-bf6-bidirectional-cast`

`--cache-strategy <cache_strategy>`

`--ce-delay-ms <ce_delay_ms>`

`--chat-template <chat_template>`

`--config-file <config_file>`

`--config-file <config_file>`

`--config-file <config_file>`

`--config-file <config_file>`

`--config-file <config_file>`

`--config-file <config_file>`

`--config-file <config_file>`

`--custom-architectures <custom_architectures>`

`--data-parallel-degree <data_parallel_degree>`

`--debug-verify-replay, --no-debug-verify-replay`

`--defer-resolve, --no-defer-resolve`

`--device-graph-capture, --no-device-graph-capture`

`--device-memory-utilization <device_memory_utilization>`

`--devices <devices>`

`--disk-offload-dir <disk_offload_dir>`

`--disk-offload-direct-io, --no-disk-offload-direct-io`

`--disk-offload-max-gb <disk_offload_max_gb>`

`--draft-allow-safetensors-weights-fp32-bf6-bidirectional-cast, --no-draft-allow-safetensors-weights-fp32-bf6-bidirectional-cast`

`--draft-chat-template <draft_chat_template>`

`--draft-config-file <draft_config_file>`

`--draft-data-parallel-degree <draft_data_parallel_degree>`

`--draft-devices <draft_devices>`

`--draft-enable-echo, --no-draft-enable-echo`

`--draft-force-download, --no-draft-force-download`

`--draft-huggingface-model-revision <draft_huggingface_model_revision>`

`--draft-huggingface-weight-revision <draft_huggingface_weight_revision>`

`--draft-max-length <draft_max_length>`

`--draft-model-path <draft_model_path>`

`--draft-pool-embeddings, --no-draft-pool-embeddings`

`--draft-quantization-encoding <draft_quantization_encoding>`

`--draft-rope-type <draft_rope_type>`

`--draft-section-name <draft_section_name>`

`--draft-served-model-name <draft_served_model_name>`

`--draft-trust-remote-code, --no-draft-trust-remote-code`

`--draft-use-subgraphs, --no-draft-use-subgraphs`

`--draft-vision-config-overrides <draft_vision_config_overrides>`

`--draft-weight-path <draft_weight_path>`

`--enable-chunked-prefill, --no-enable-chunked-prefill`

`--enable-echo, --no-enable-echo`

`--enable-in-flight-batching, --no-enable-in-flight-batching`

`--enable-kvcache-swapping-to-host, --no-enable-kvcache-swapping-to-host`

`--enable-lora, --no-enable-lora`

`--enable-min-tokens, --no-enable-min-tokens`

`--enable-overlap-scheduler, --no-enable-overlap-scheduler`

`--enable-penalties, --no-enable-penalties`

`--enable-prefix-caching, --no-enable-prefix-caching`

`--enable-prioritize-first-decode, --no-enable-prioritize-first-decode`

`--enable-structured-output, --no-enable-structured-output`

`--enable-variable-logits, --no-enable-variable-logits`

`--ep-size <ep_size>`

`--execute-empty-batches, --no-execute-empty-batches`

`--force, --no-force`

`--force-download, --no-force-download`

`--gpu-profiling <gpu_profiling>`

`--headless`

`--host-kvcache-swap-space-gb <host_kvcache_swap_space_gb>`

`--huggingface-model-revision <huggingface_model_revision>`

`--huggingface-weight-revision <huggingface_weight_revision>`

`--kv-cache-format <kv_cache_format>`

`--kv-cache-page-size <kv_cache_page_size>`

`--kvcache-ce-watermark <kvcache_ce_watermark>`

`--lmcache-config-file <lmcache_config_file>`

`--log-prefix <log_prefix>`

`--lora-paths <lora_paths>`

`--max-batch-input-tokens <max_batch_input_tokens>`

`--max-batch-size <max_batch_size>`

`--max-batch-total-tokens <max_batch_total_tokens>`

`--max-length <max_length>`

`--max-lora-rank <max_lora_rank>`

`--max-num-loras <max_num_loras>`

`--max-num-steps <max_num_steps>`

`--max-queue-size-tg <max_queue_size_tg>`

`--min-batch-size-tg <min_batch_size_tg>`

`--model-path <model_path>`