For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

max generate

Generates output from a given model and prompt without using an endpoint. This is primarily useful for debugging and testing.

For example, generate a short completion from a prompt:

max generate \
  --model google/gemma-3-12b-it \
  --max-length 1024 \
  --max-new-tokens 500 \
  --top-k 40 \
  --temperature 0.7 \
  --seed 42 \
  --prompt "Explain quantum computing"

You can adjust parameters like --max-batch-size and --max-length depending on your system's available resources such as GPU memory.

For more information on how to use the generate command with vision models, see Image to text.

Usage

max generate [OPTIONS]

Options

--allow-extra-request-fields, --no-allow-extra-request-fields

When True, unknown top-level fields on OpenAI-compatible request bodies are dropped with a warning before pydantic validation, instead of producing a 400.

--allow-unsupported-logprobs, --no-allow-unsupported-logprobs

When True, OpenAI-compatible requests that ask for logprobs against a runtime configuration that cannot honor them will raise a warning, and served as if logprobs were not requested. Each response chunk carries logprobs: null. When False (default), such requests are rejected with a 400.

--ce-delay-ms <ce_delay_ms>

Duration of scheduler sleep prior to starting a prefill batch.

--chat-template <chat_template>

Optional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If None, the model's default chat template is used.

--config-file <config_file>

--custom-architectures <custom_architectures>

Custom architecture implementations to register. Each input is either a path to a single custom-architecture module directory or an IMPORT_PATH:MODULE_NAME colon-form. Each module must expose a top-level ARCHITECTURES list of SupportedArchitecture instances.

--data-parallel-degree <data_parallel_degree>

Data-parallelism parameter. The degree to which the model is replicated is dependent on the model type.

--debug-verify-replay, --no-debug-verify-replay

When device_graph_capture is enabled, execute eager launch-trace verification before replay. Intended for debugging only.

--decode-request-ttl-s <decode_request_ttl_s>

Per-request TTL in seconds for the decode-side prefill_reqs and inflight_transfers dicts. Entries older than this are evicted individually (KV blocks released, failure surfaced to the client) before the stall watchdog fires. None (the default) disables eviction. Set with the MODULAR_DECODE_REQUEST_TTL_S environment variable.

--decode-stall-timeout-s <decode_stall_timeout_s>

Seconds of no-batch-activity after which the decode worker exits to trigger a pod restart. None (the default) disables the watchdog. Set with the MODULAR_DECODE_STALL_TIMEOUT_S environment variable.

--defer-resolve, --no-defer-resolve

Whether to defer resolving the pipeline config.

--detokenize, --no-detokenize

Whether to detokenize the output tokens into text.

--device-graph-capture, --no-device-graph-capture

Enable device graph capture and replay for graph execution. If unset, automatically enabled for some selected architectures. Use --no-device-graph-capture to explicitly disable.

--device-memory-utilization <device_memory_utilization>

The fraction of available device memory that the process should consume. The remaining headroom holds the KV cache: kv_cache_workspace = (total_free_memory * device_memory_utilization) - model_weights_size.

--devices <devices>

Whether to run the model on CPU (--devices=cpu), GPU (--devices=gpu), every visible GPU (--devices=gpu:all), or a list of GPUs (--devices=gpu:0,1). An ID value can be provided optionally to indicate the device ID to target. If not provided, the model or config default is used.

--draft-chat-template <draft_chat_template>

Optional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If None, the model's default chat template is used.

--draft-config-file <draft_config_file>

--draft-data-parallel-degree <draft_data_parallel_degree>

Data-parallelism parameter. The degree to which the model is replicated is dependent on the model type.

--draft-devices <draft_devices>

Devices for the draft model in speculative decoding. If not provided, inherits from --devices. Accepts the same format as --devices.

--draft-enable-echo, --no-draft-enable-echo

Whether the model should be built with echo capabilities.

--draft-force-download, --no-draft-force-download

Whether to force download a given file if it's already present in the local cache.

--draft-huggingface-model-revision <draft_huggingface_model_revision>

Branch or Git revision of Hugging Face model repository to use.

--draft-huggingface-weight-revision <draft_huggingface_weight_revision>

Branch or Git revision of Hugging Face model repository to use.

--draft-max-length <draft_max_length>

Maximum sequence length the model can process. If not specified, defaults to the model's max_position_embeddings. May be clamped during resolution based on available memory.

--draft-model-path <draft_model_path>

Accepts either a Hugging Face repository ID or a local path to the model.

--draft-pool-embeddings, --no-draft-pool-embeddings

Whether to pool embedding outputs.

--draft-quantization-encoding <draft_quantization_encoding>

Weight encoding type. For GGUF models, the encoding is auto-detected from the repository when unset; if set, it must match an available encoding. When the repository contains multiple quantization formats, set this to choose one.

Options:

float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq

--draft-rope-type <draft_rope_type>

Force using a specific rope type. Only matters for GGUF weights.

Options:

none | normal | neox | longrope | yarn

--draft-section-name <draft_section_name>

--draft-served-model-name <draft_served_model_name>

Optional override for client-facing model name. Defaults to model_path.

--draft-sliding-window <draft_sliding_window>

If set, overrides the model's attention to use a sliding-window causal mask of this many tokens. None (the default) defers to the HuggingFace config's sliding_window field, or full causal attention if the model doesn't advertise one.

--draft-subfolder <draft_subfolder>

Subdirectory within the HuggingFace repo to load config and weights from (for example, vae or text_encoder). When set, config.json and weights are resolved from {model_path}/{subfolder}/.

--draft-trust-remote-code, --no-draft-trust-remote-code

Whether or not to allow for custom modeling files on Hugging Face.

--draft-use-subgraphs, --no-draft-use-subgraphs

Whether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.

--draft-vision-config-overrides <draft_vision_config_overrides>

Model-specific vision configuration overrides. For example, for InternVL: {"max_dynamic_patch": 24}.

--draft-weight-path <draft_weight_path>

Optional path or URL of the model weights to use. Overrides default weight discovery.

--enable-chunked-prefill, --no-enable-chunked-prefill

Enable chunked prefill to split context encoding requests into multiple chunks based on max_batch_input_tokens.

--enable-echo, --no-enable-echo

Whether the model should be built with echo capabilities.

--enable-in-flight-batching, --no-enable-in-flight-batching

When enabled, prioritizes token generation by batching it with context encoding requests.

--enable-lora, --no-enable-lora

Enables LoRA on the server.

--enable-min-tokens, --no-enable-min-tokens

Whether to enable min_tokens, which blocks the model from generating stopping tokens before the min_tokens count is reached.

--enable-overlap-scheduler, --no-enable-overlap-scheduler

Whether to enable the overlap scheduler. This feature allows the scheduler to run alongside GPU execution. This helps improve GPU utilization. This is an experimental feature which may crash and burn. This feature will be enabled by default for some selected architectures. You can forcibly disable this by setting --no-enable-overlap-scheduler --force.

--enable-penalties, --no-enable-penalties

Whether to apply frequency and presence penalties to the model's output.

--enable-prefix-caching, --no-enable-prefix-caching

Whether to enable prefix caching for the paged KVCache.

--enable-prioritize-first-decode, --no-enable-prioritize-first-decode

When enabled, the scheduler always runs a TG batch immediately after a CE batch with the same requests. This may reduce time-to-first-chunk latency.

--enable-structured-output, --no-enable-structured-output

Enable structured generation/guided decoding for the server. This allows the user to pass a JSON schema in the response_format field, which the LLM will adhere to.

--enable-variable-logits, --no-enable-variable-logits

Enable the sampling graph to accept a ragged tensor of different sequences as inputs, along with their associated logit_offsets. This is needed to produce additional logits for echo and speculative decoding purposes.

--ep-size <ep_size>

The expert parallelism size. Needs to be 1 (no expert parallelism) or the total number of GPUs across nodes.

--ep-use-allreduce, --no-ep-use-allreduce

Whether to use allreduce for the cross-device communication in expert parallelism.

--execute-empty-batches, --no-execute-empty-batches

Whether the scheduler should execute empty batches.

--first-block-caching, --no-first-block-caching

Enable First-Block Cache (FBCache) for step-cache denoising. When enabled, the transformer skips remaining blocks if the first-block residual is similar to the previous step.

--force, --no-force

Skip validation of user provided flags against the architecture's required arguments.

--force-download, --no-force-download

Whether to force download a given file if it's already present in the local cache.

--frequency-penalty <frequency_penalty>

The frequency penalty to apply to the model's output. A positive value will penalize new tokens based on their frequency in the generated text.

--gpu-profiling <gpu_profiling>

Whether to enable GPU profiling of the model.

Options:

off | on | detailed

--huggingface-model-revision <huggingface_model_revision>

Branch or Git revision of Hugging Face model repository to use.

--huggingface-weight-revision <huggingface_weight_revision>

Branch or Git revision of Hugging Face model repository to use.

--ignore-eos

If True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.

--image_url <image_url>

Images to include along with prompt, specified as URLs. The images are ignored if the model does not support image inputs.

--kv-cache-format <kv_cache_format>

Override the default data type for the KV cache. Supported values: float32, bfloat16, float8_e4m3fn.

--kv-cache-page-size <kv_cache_page_size>

The number of tokens in a single page in the paged KVCache.

--kv-connector <kv_connector>

Type of KV cache connector to use. When not set, defaults to null (no external caching).

Options:

KVConnectorType.null | KVConnectorType.local | KVConnectorType.tiered | KVConnectorType.dkv

--kv-connector-config <kv_connector_config>

Connector-specific configuration overrides as inline JSON or path to a YAML/JSON file. Each connector type has sensible defaults, so this is only needed for customization.

--kvcache-ce-watermark <kvcache_ce_watermark>

Projected cache usage threshold for scheduling CE requests, considering current and incoming requests. CE is scheduled if either projected usage stays below this threshold or no active requests exist. Higher values can cause more preemptions.

--lora-paths <lora_paths>

List of statically defined LoRA paths.

--max-batch-input-tokens <max_batch_input_tokens>

The target number of un-encoded tokens to include in each batch. This value is used for chunked prefill and memory estimation.

--max-batch-size <max_batch_size>

Maximum batch size to execute with the model. When not specified (None), this value is determined dynamically. For server launches, set this higher based on server capacity.

--max-batch-total-tokens <max_batch_total_tokens>

Ensures the sum of page-aligned context lengths in a batch does not exceed max_batch_total_tokens. Alignment uses the KV cache page size. If None, the sum is not limited.

--max-length <max_length>

Maximum sequence length the model can process. If not specified, defaults to the model's max_position_embeddings. May be clamped during resolution based on available memory.

--max-lora-rank <max_lora_rank>

Maximum rank of all possible LoRAs.

--max-new-tokens <max_new_tokens>

Maximum number of new tokens to generate during a single inference pass of the model.

--max-num-loras <max_num_loras>

The maximum number of active LoRAs in a batch. This controls how many LoRA adapters can be active simultaneously during inference. Lower values reduce memory usage but limit concurrent adapter usage.

--max-num-steps <max_num_steps>

Deprecated. Multi-step pipeline execution is no longer supported; the pipeline always runs single-step decode. Values other than 1 (including the legacy default -1) are ignored after logging a warning.

--max-queue-size-tg <max_queue_size_tg>

Maximum number of requests in decode queue. By default, this is max_batch_size.

--max-vision-cache-entries <max_vision_cache_entries>

Maximum number of images cached in the vision encoder cache. Each entry stores the vision encoder output for one image, avoiding re-encoding across chunks and requests. Set to 0 to disable caching. Only used by VLMs.

--min-batch-size-tg <min_batch_size_tg>

Soft floor on the decode batch size. If the TG batch size is larger, the scheduler continues TG batches; if it falls below, the scheduler prioritizes CE. This is not a strict minimum. By default, this is max_queue_size_tg.

--min-new-tokens <min_new_tokens>

Minimum number of tokens to generate in the response.

--min-p <min_p>

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

--model, --model-path <model_path>

Accepts either a Hugging Face repository ID or a local path to the model.

--model-override <model_override>

Per-component overrides for the ModelManifest, in the format component.field=value. Applied before resolution. Repeatable. Example: transformer.quantization_encoding=float4_e2m1fnx2.

--models <models>

The model manifest containing all model configs keyed by role.

--num-speculative-tokens <num_speculative_tokens>

The number of speculative tokens.

--num-warmups <num_warmups>

Number of warmup iterations to run before the final timed run.

Default:

0

--pipeline-role <pipeline_role>

Whether the pipeline should serve both a prefill or decode role or both.

Options:

prefill_and_decode | prefill_only | decode_only

--pool-embeddings, --no-pool-embeddings

Whether to pool embedding outputs.

--prefer-module-v3, --no-prefer-module-v3

Whether to prefer the eager API architecture over the graph API architecture. When False (default), the inference server uses the graph API architecture. When True, the server uses the eager API architecture when available and falls back to the graph API architecture.

--presence-penalty <presence_penalty>

The presence penalty to apply to the model's output. A positive value will penalize new tokens that have already appeared in the generated text at least once.

--profile

Capture a rudimentary profile of the timed run. If Nsight Systems (nsys) and an NVIDIA GPU are available, captures the GPU kernel trace into an .nsys-rep file and prints the top kernels. Always captures a Python/CPU profile via cProfile.

Default:

False

--profile-output <profile_output>

Path for the .nsys-rep file when --profile is on. Default: $BUILD_WORKSPACE_DIRECTORY/max-profile.nsys-rep, or ./max-profile.nsys-rep.

--profile-top-n <profile_top_n>

Number of rows to show in the GPU kernel and Python profile tables.

Default:

15

--prompt <prompt>

The text prompt to use for further generation.

--quantization-encoding <quantization_encoding>

Weight encoding type. For GGUF models, the encoding is auto-detected from the repository when unset; if set, it must match an available encoding. When the repository contains multiple quantization formats, set this to choose one.

Options:

float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq

--reasoning-parser <reasoning_parser>

Name of the reasoning output parser. The parser extracts thinking blocks to populate the reasoning field in chat completion responses. When unset, the server applies the architecture's default reasoning parser, if any. Pass "none" (case-insensitive) to explicitly disable reasoning parsing even when the architecture declares a default.

--rejection-sampling-strategy <rejection_sampling_strategy>

Rejection sampling strategy for verifying draft tokens. Defaults to typical-acceptance for eagle/mtp.

Options:

greedy | residual | typical-acceptance | logit-comparison

--relaxed-delta <relaxed_delta>

Probability gap below the top-1 candidate inside which candidates remain eligible for relaxed acceptance. A draft token is accepted if it matches any top-N candidate whose probability is at least top1_prob - relaxed_delta. Ignored when use_relaxed_acceptance_for_thinking is False.

--relaxed-topk <relaxed_topk>

Top-N candidates from the target distribution to consider when relaxed acceptance is active. Ignored when use_relaxed_acceptance_for_thinking is False.

--repetition-penalty <repetition_penalty>

The repetition penalty to apply to the model's output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once.

--rope-type <rope_type>

Force using a specific rope type. Only matters for GGUF weights.

Options:

none | normal | neox | longrope | yarn

--section-name <section_name>

--seed <seed>

Seed for the random number generator.

--served-model-name <served_model_name>

Optional override for client-facing model name. Defaults to model_path.

--sliding-window <sliding_window>

If set, overrides the model's attention to use a sliding-window causal mask of this many tokens. None (the default) defers to the HuggingFace config's sliding_window field, or full causal attention if the model doesn't advertise one.

--speculative-method <speculative_method>

The speculative decoding method to use.

Options:

eagle | mtp | dflash

--stop <stop>

A list of detokenized sequences that can be used as stop criteria when generating a new sequence. Can be specified multiple times.

--stop-token-ids <stop_token_ids>

A list of token ids that are used as stopping criteria when generating a new sequence. Comma-separated integers.

--subfolder <subfolder>

Subdirectory within the HuggingFace repo to load config and weights from (for example, vae or text_encoder). When set, config.json and weights are resolved from {model_path}/{subfolder}/.

--synthetic-acceptance-rate <synthetic_acceptance_rate>

Synthetic acceptance rate for benchmarking (0.0 to 1.0). When set, the rejection sampler bypasses the real draft/target comparison and accepts each draft position with a calibrated probability so the mean joint acceptance across num_speculative_tokens positions matches this value.

--task <task>

The pipeline task to run (e.g. text_generation, embeddings_generation). Used to disambiguate architectures registered under the same name for multiple tasks.

Options:

PipelineTask.TEXT_GENERATION | PipelineTask.EMBEDDINGS_GENERATION | PipelineTask.PIXEL_GENERATION | PipelineTask.UNDEFINED

--taylorseer, --no-taylorseer

Enable TaylorSeer cache optimization. Uses Taylor series prediction to skip full transformer passes on certain denoising steps.

--taylorseer-cache-interval <taylorseer_cache_interval>

Steps between full TaylorSeer computations. None uses the model-specific default (typically 5).

--taylorseer-max-order <taylorseer_max_order>

Taylor expansion order (1 or 2). Higher order uses second derivatives for more accurate prediction. None uses the model-specific default (typically 1).

--taylorseer-warmup-steps <taylorseer_warmup_steps>

Number of warmup steps before TaylorSeer prediction begins. None uses the model-specific default (typically 4).

--temperature <temperature>

Default sampling temperature. Controls randomness of token selection—higher values (e.g. 1.0) produce more random outputs, lower values (e.g. 0.2) produce more deterministic outputs. When set, this server-level default applies to all requests that do not explicitly provide temperature.

--temperature <temperature>

Controls the randomness of the model's output; higher values produce more diverse responses.

--thinking-temperature <thinking_temperature>

Default temperature override for tokens inside <think>...</think> blocks. When set, this server-level default applies to all requests that do not explicitly provide thinking_temperature. Requires a reasoning parser to be configured; ignored otherwise.

--tool-parser <tool_parser>

Name of the tool call parser. The parser extracts tool calls from model output in chat completion responses. When unset, the server applies the architecture's default tool parser, if any. Pass "none" (case-insensitive) to explicitly disable tool parsing even when the architecture declares a default.

--top-k <top_k>

Limits the sampling to the K most probable tokens. This defaults to 255. For greedy sampling, set to 1.

--top-p <top_p>

Only use the tokens whose cumulative probability is within the top_p threshold. This applies to the top_k tokens.

--trust-remote-code, --no-trust-remote-code

Whether or not to allow for custom modeling files on Hugging Face.

--use-experimental-kernels <use_experimental_kernels>

Enables using experimental Mojo kernels with max serve. The kernels could be unstable or incorrect.

--use-relaxed-acceptance-for-thinking, --no-use-relaxed-acceptance-for-thinking

Enables relaxed acceptance for speculative decoding draft positions inside a <think>...</think> block. The target's top-N candidates (filtered by a probability threshold top1_prob - relaxed_delta) are compared against the draft token; matching any candidate accepts the draft. Outside the thinking span, the existing strict acceptance rule still applies.

--use-subgraphs, --no-use-subgraphs

Whether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.

--use-vendor-blas <use_vendor_blas>

Enables using vendor BLAS libraries (cublas, hipblas, etc.) with max serve. Currently, this just replaces matmul calls.

--use-vendor-ccl <use_vendor_ccl>

Enables using vendor CCL libraries (NCCL/RCCL) for collective operations such as allreduce in multi-GPU inference.

--vision-config-overrides <vision_config_overrides>

Model-specific vision configuration overrides. For example, for InternVL: {"max_dynamic_patch": 24}.

--weight-path <weight_path>

Optional path or URL of the model weights to use. Overrides default weight discovery.

Usage
Options

Usage​

Options​

--allow-extra-request-fields, --no-allow-extra-request-fields​

--allow-unsupported-logprobs, --no-allow-unsupported-logprobs​

--ce-delay-ms <ce_delay_ms>​

--chat-template <chat_template>​

--config-file <config_file>​

--custom-architectures <custom_architectures>​

--data-parallel-degree <data_parallel_degree>​

--debug-verify-replay, --no-debug-verify-replay​

--decode-request-ttl-s <decode_request_ttl_s>​

--decode-stall-timeout-s <decode_stall_timeout_s>​

--defer-resolve, --no-defer-resolve​

--detokenize, --no-detokenize​

--device-graph-capture, --no-device-graph-capture​

--device-memory-utilization <device_memory_utilization>​

--devices <devices>​

--draft-chat-template <draft_chat_template>​

--draft-config-file <draft_config_file>​

--draft-data-parallel-degree <draft_data_parallel_degree>​

--draft-devices <draft_devices>​

--draft-enable-echo, --no-draft-enable-echo​

--draft-force-download, --no-draft-force-download​

--draft-huggingface-model-revision <draft_huggingface_model_revision>​

--draft-huggingface-weight-revision <draft_huggingface_weight_revision>​

--draft-max-length <draft_max_length>​

--draft-model-path <draft_model_path>​

--draft-pool-embeddings, --no-draft-pool-embeddings​

--draft-quantization-encoding <draft_quantization_encoding>​

--draft-rope-type <draft_rope_type>​

--draft-section-name <draft_section_name>​

--draft-served-model-name <draft_served_model_name>​

--draft-sliding-window <draft_sliding_window>​

--draft-subfolder <draft_subfolder>​

--draft-trust-remote-code, --no-draft-trust-remote-code​

--draft-use-subgraphs, --no-draft-use-subgraphs​

--draft-vision-config-overrides <draft_vision_config_overrides>​

--draft-weight-path <draft_weight_path>​

--enable-chunked-prefill, --no-enable-chunked-prefill​

--enable-echo, --no-enable-echo​

--enable-in-flight-batching, --no-enable-in-flight-batching​

--enable-lora, --no-enable-lora​

--enable-min-tokens, --no-enable-min-tokens​

--enable-overlap-scheduler, --no-enable-overlap-scheduler​

--enable-penalties, --no-enable-penalties​

--enable-prefix-caching, --no-enable-prefix-caching​

--enable-prioritize-first-decode, --no-enable-prioritize-first-decode​

--enable-structured-output, --no-enable-structured-output​

--enable-variable-logits, --no-enable-variable-logits​

--ep-size <ep_size>​

--ep-use-allreduce, --no-ep-use-allreduce​

--execute-empty-batches, --no-execute-empty-batches​

--first-block-caching, --no-first-block-caching​

--force, --no-force​

--force-download, --no-force-download​

--frequency-penalty <frequency_penalty>​

--gpu-profiling <gpu_profiling>​

--huggingface-model-revision <huggingface_model_revision>​

--huggingface-weight-revision <huggingface_weight_revision>​

--ignore-eos​

--image_url <image_url>​

--kv-cache-format <kv_cache_format>​

--kv-cache-page-size <kv_cache_page_size>​

--kv-connector <kv_connector>​

--kv-connector-config <kv_connector_config>​

--kvcache-ce-watermark <kvcache_ce_watermark>​

--lora-paths <lora_paths>​

--max-batch-input-tokens <max_batch_input_tokens>​

--max-batch-size <max_batch_size>​

--max-batch-total-tokens <max_batch_total_tokens>​

--max-length <max_length>​

--max-lora-rank <max_lora_rank>​

--max-new-tokens <max_new_tokens>​

--max-num-loras <max_num_loras>​

--max-num-steps <max_num_steps>​

--max-queue-size-tg <max_queue_size_tg>​

--max-vision-cache-entries <max_vision_cache_entries>​

--min-batch-size-tg <min_batch_size_tg>​

--min-new-tokens <min_new_tokens>​

--min-p <min_p>​

Usage

Options

`--allow-extra-request-fields, --no-allow-extra-request-fields`

`--allow-unsupported-logprobs, --no-allow-unsupported-logprobs`

`--ce-delay-ms <ce_delay_ms>`

`--chat-template <chat_template>`

`--config-file <config_file>`

`--custom-architectures <custom_architectures>`

`--data-parallel-degree <data_parallel_degree>`

`--debug-verify-replay, --no-debug-verify-replay`

`--decode-request-ttl-s <decode_request_ttl_s>`

`--decode-stall-timeout-s <decode_stall_timeout_s>`

`--defer-resolve, --no-defer-resolve`

`--detokenize, --no-detokenize`

`--device-graph-capture, --no-device-graph-capture`

`--device-memory-utilization <device_memory_utilization>`

`--devices <devices>`

`--draft-chat-template <draft_chat_template>`

`--draft-config-file <draft_config_file>`

`--draft-data-parallel-degree <draft_data_parallel_degree>`

`--draft-devices <draft_devices>`

`--draft-enable-echo, --no-draft-enable-echo`

`--draft-force-download, --no-draft-force-download`

`--draft-huggingface-model-revision <draft_huggingface_model_revision>`

`--draft-huggingface-weight-revision <draft_huggingface_weight_revision>`

`--draft-max-length <draft_max_length>`

`--draft-model-path <draft_model_path>`

`--draft-pool-embeddings, --no-draft-pool-embeddings`

`--draft-quantization-encoding <draft_quantization_encoding>`

`--draft-rope-type <draft_rope_type>`

`--draft-section-name <draft_section_name>`

`--draft-served-model-name <draft_served_model_name>`

`--draft-sliding-window <draft_sliding_window>`

`--draft-subfolder <draft_subfolder>`

`--draft-trust-remote-code, --no-draft-trust-remote-code`

`--draft-use-subgraphs, --no-draft-use-subgraphs`

`--draft-vision-config-overrides <draft_vision_config_overrides>`

`--draft-weight-path <draft_weight_path>`

`--enable-chunked-prefill, --no-enable-chunked-prefill`

`--enable-echo, --no-enable-echo`

`--enable-in-flight-batching, --no-enable-in-flight-batching`

`--enable-lora, --no-enable-lora`

`--enable-min-tokens, --no-enable-min-tokens`

`--enable-overlap-scheduler, --no-enable-overlap-scheduler`

`--enable-penalties, --no-enable-penalties`

`--enable-prefix-caching, --no-enable-prefix-caching`

`--enable-prioritize-first-decode, --no-enable-prioritize-first-decode`

`--enable-structured-output, --no-enable-structured-output`

`--enable-variable-logits, --no-enable-variable-logits`

`--ep-size <ep_size>`

`--ep-use-allreduce, --no-ep-use-allreduce`

`--execute-empty-batches, --no-execute-empty-batches`

`--first-block-caching, --no-first-block-caching`

`--force, --no-force`

`--force-download, --no-force-download`

`--frequency-penalty <frequency_penalty>`

`--gpu-profiling <gpu_profiling>`

`--huggingface-model-revision <huggingface_model_revision>`

`--huggingface-weight-revision <huggingface_weight_revision>`

`--ignore-eos`

`--image_url <image_url>`

`--kv-cache-format <kv_cache_format>`

`--kv-cache-page-size <kv_cache_page_size>`

`--kv-connector <kv_connector>`

`--kv-connector-config <kv_connector_config>`

`--kvcache-ce-watermark <kvcache_ce_watermark>`

`--lora-paths <lora_paths>`

`--max-batch-input-tokens <max_batch_input_tokens>`

`--max-batch-size <max_batch_size>`

`--max-batch-total-tokens <max_batch_total_tokens>`

`--max-length <max_length>`

`--max-lora-rank <max_lora_rank>`

`--max-new-tokens <max_new_tokens>`

`--max-num-loras <max_num_loras>`

`--max-num-steps <max_num_steps>`

`--max-queue-size-tg <max_queue_size_tg>`

`--max-vision-cache-entries <max_vision_cache_entries>`

`--min-batch-size-tg <min_batch_size_tg>`

`--min-new-tokens <min_new_tokens>`

`--min-p <min_p>`