max generate
Generates output from a given model and prompt without using an endpoint. This is primarily useful for debugging and testing.
For example, generate a short completion from a prompt:
max generate \
--model google/gemma-3-12b-it \
--max-length 1024 \
--max-new-tokens 500 \
--top-k 40 \
--temperature 0.7 \
--seed 42 \
--prompt "Explain quantum computing"For more information on how to use the generate command with vision models,
see Image to text.
Usageβ
max generate [OPTIONS]Optionsβ
-
--ce-delay-ms <ce_delay_ms>βDuration of scheduler sleep prior to starting a prefill batch. Experimental for the TTS scheduler.
-
--chat-template <chat_template>βOptional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If
None, the model's default chat template is used.
-
--config-file <config_file>β
-
--custom-architectures <custom_architectures>βCustom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name. Each module must expose an
ARCHITECTURESlist of architectures to register.
-
--data-parallel-degree <data_parallel_degree>βData-parallelism parameter. The degree to which the model is replicated is dependent on the model type.
-
--debug-verify-replay, --no-debug-verify-replayβWhen
device_graph_captureis enabled, execute eager launch-trace verification before replay. Intended for debugging only.
-
--decode-stall-timeout-s <decode_stall_timeout_s>βSeconds of no-batch-activity after which the decode worker exits to trigger a pod restart.
None(the default) disables the watchdog. Set with theMODULAR_DECODE_STALL_TIMEOUT_Senvironment variable.
-
--defer-resolve, --no-defer-resolveβWhether to defer resolving the pipeline config.
-
--detokenize, --no-detokenizeβWhether to detokenize the output tokens into text.
-
--device-graph-capture, --no-device-graph-captureβEnable device graph capture and replay for graph execution. If unset, automatically enabled for some selected architectures. Use
--no-device-graph-captureto explicitly disable.
-
--device-memory-utilization <device_memory_utilization>βThe fraction of available device memory that the process should consume. This informs the KVCache workspace size:
kv_cache_workspace = (total_free_memory * device_memory_utilization) - model_weights_size.
-
--devices <devices>βWhether to run the model on CPU (
--devices=cpu), GPU (--devices=gpu) or a list of GPUs (--devices=gpu:0,1). An ID value can be provided optionally to indicate the device ID to target. If not provided, the model will run on the first available GPU, or CPU if no GPUs are available.
-
--draft-chat-template <draft_chat_template>βOptional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If
None, the model's default chat template is used.
-
--draft-config-file <draft_config_file>β
-
--draft-data-parallel-degree <draft_data_parallel_degree>βData-parallelism parameter. The degree to which the model is replicated is dependent on the model type.
-
--draft-devices <draft_devices>βDevices for the draft model in speculative decoding. If not provided, inherits from
--devices. Accepts the same format as--devices.
-
--draft-enable-echo, --no-draft-enable-echoβWhether the model should be built with echo capabilities.
-
--draft-force-download, --no-draft-force-downloadβWhether to force download a given file if it's already present in the local cache.
-
--draft-huggingface-model-revision <draft_huggingface_model_revision>βBranch or Git revision of Hugging Face model repository to use.
-
--draft-huggingface-weight-revision <draft_huggingface_weight_revision>βBranch or Git revision of Hugging Face model repository to use.
-
--draft-max-length <draft_max_length>βMaximum sequence length the model can process. If not specified, defaults to the model's
max_position_embeddings. May be clamped during resolution based on available memory.
-
--draft-model-path <draft_model_path>βAccepts either a Hugging Face repository ID or a local path to the model.
-
--draft-pool-embeddings, --no-draft-pool-embeddingsβWhether to pool embedding outputs.
-
--draft-quantization-encoding <draft_quantization_encoding>βWeight encoding type. For GGUF models, the encoding is auto-detected from the repository when unset; if set, it must match an available encoding. When the repository contains multiple quantization formats, set this to choose one.
-
Options:
-
float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq
-
-
--draft-rope-type <draft_rope_type>βForce using a specific rope type. Only matters for GGUF weights.
-
Options:
-
none | normal | neox | longrope | yarn
-
-
--draft-section-name <draft_section_name>β
-
--draft-served-model-name <draft_served_model_name>βOptional override for client-facing model name. Defaults to
model_path.
-
--draft-subfolder <draft_subfolder>βSubdirectory within the HuggingFace repo to load config and weights from (for example,
vaeortext_encoder). When set,config.jsonand weights are resolved from{model_path}/{subfolder}/.
-
--draft-trust-remote-code, --no-draft-trust-remote-codeβWhether or not to allow for custom modeling files on Hugging Face.
-
--draft-use-subgraphs, --no-draft-use-subgraphsβWhether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.
-
--draft-vision-config-overrides <draft_vision_config_overrides>βModel-specific vision configuration overrides. For example, for InternVL:
{"max_dynamic_patch": 24}.
-
--draft-weight-path <draft_weight_path>βOptional path or URL of the model weights to use.
-
--enable-chunked-prefill, --no-enable-chunked-prefillβEnable chunked prefill to split context encoding requests into multiple chunks based on
max_batch_input_tokens.
-
--enable-echo, --no-enable-echoβWhether the model should be built with echo capabilities.
-
--enable-in-flight-batching, --no-enable-in-flight-batchingβWhen enabled, prioritizes token generation by batching it with context encoding requests.
-
--enable-lora, --no-enable-loraβEnables LoRA on the server.
-
--enable-min-tokens, --no-enable-min-tokensβWhether to enable
min_tokens, which blocks the model from generating stopping tokens before themin_tokenscount is reached.
-
--enable-overlap-scheduler, --no-enable-overlap-schedulerβWhether to enable the overlap scheduler. This feature allows the scheduler to run alongside GPU execution. This helps improve GPU utilization. This is an experimental feature which may crash and burn. This feature will be enabled by default for some selected architectures. You can forcibly disable this by setting
--no-enable-overlap-scheduler --force.
-
--enable-penalties, --no-enable-penaltiesβWhether to apply frequency and presence penalties to the model's output.
-
--enable-prefix-caching, --no-enable-prefix-cachingβWhether to enable prefix caching for the paged KVCache.
-
--enable-prioritize-first-decode, --no-enable-prioritize-first-decodeβWhen enabled, the scheduler always runs a TG batch immediately after a CE batch with the same requests. This may reduce time-to-first-chunk latency. Experimental for the TTS scheduler.
-
--enable-structured-output, --no-enable-structured-outputβEnable structured generation/guided decoding for the server. This allows the user to pass a JSON schema in the
response_formatfield, which the LLM will adhere to.
-
--enable-variable-logits, --no-enable-variable-logitsβEnable the sampling graph to accept a ragged tensor of different sequences as inputs, along with their associated
logit_offsets. This is needed to produce additional logits for echo and speculative decoding purposes.
-
--ep-size <ep_size>βThe expert parallelism size. Needs to be 1 (no expert parallelism) or the total number of GPUs across nodes.
-
--ep-use-allreduce, --no-ep-use-allreduceβWhether to use allreduce for the cross-device communication in expert parallelism.
-
--execute-empty-batches, --no-execute-empty-batchesβWhether the scheduler should execute empty batches.
-
--first-block-caching, --no-first-block-cachingβEnable First-Block Cache (FBCache) for step-cache denoising. When enabled, the transformer skips remaining blocks if the first-block residual is similar to the previous step.
-
--force, --no-forceβSkip validation of user provided flags against the architecture's required arguments.
-
--force-download, --no-force-downloadβWhether to force download a given file if it's already present in the local cache.
-
--frequency-penalty <frequency_penalty>βThe frequency penalty to apply to the model's output. A positive value will penalize new tokens based on their frequency in the generated text.
-
--gpu-profiling <gpu_profiling>βWhether to enable GPU profiling of the model.
-
Options:
-
off | on | detailed
-
-
--huggingface-model-revision <huggingface_model_revision>βBranch or Git revision of Hugging Face model repository to use.
-
--huggingface-weight-revision <huggingface_weight_revision>βBranch or Git revision of Hugging Face model repository to use.
-
--ignore-eosβIf True, the response will ignore the EOS token, and continue to generate until the max tokens or a stop string is hit.
-
--image_url <image_url>βImages to include along with prompt, specified as URLs. The images are ignored if the model does not support image inputs.
-
--kv-cache-format <kv_cache_format>βOverride the default data type for the KV cache. Supported values:
float32,bfloat16,float8_e4m3fn.
-
--kv-cache-page-size <kv_cache_page_size>βThe number of tokens in a single page in the paged KVCache.
-
--kv-connector <kv_connector>βType of KV cache connector to use. When not set, defaults to
null(no external caching).-
Options:
-
KVConnectorType.null | KVConnectorType.local | KVConnectorType.tiered | KVConnectorType.lmcache | KVConnectorType.dkv
-
-
--kv-connector-config <kv_connector_config>βConnector-specific configuration overrides as inline JSON or path to a YAML/JSON file. Each connector type has sensible defaults, so this is only needed for customization.
-
--kvcache-ce-watermark <kvcache_ce_watermark>βProjected cache usage threshold for scheduling CE requests, considering current and incoming requests. CE is scheduled if either projected usage stays below this threshold or no active requests exist. Higher values can cause more preemptions.
-
--lora-paths <lora_paths>βList of statically defined LoRA paths.
-
--max-batch-input-tokens <max_batch_input_tokens>βThe target number of un-encoded tokens to include in each batch. This value is used for chunked prefill and memory estimation.
-
--max-batch-size <max_batch_size>βMaximum batch size to execute with the model. When not specified (
None), this value is determined dynamically. For server launches, set this higher based on server capacity.
-
--max-batch-total-tokens <max_batch_total_tokens>βEnsures the sum of page-aligned context lengths in a batch does not exceed
max_batch_total_tokens. Alignment uses the KV cache page size. IfNone, the sum is not limited.
-
--max-length <max_length>βMaximum sequence length the model can process. If not specified, defaults to the model's
max_position_embeddings. May be clamped during resolution based on available memory.
-
--max-lora-rank <max_lora_rank>βMaximum rank of all possible LoRAs.
-
--max-new-tokens <max_new_tokens>βMaximum number of new tokens to generate during a single inference pass of the model.
-
--max-num-loras <max_num_loras>βThe maximum number of active LoRAs in a batch. This controls how many LoRA adapters can be active simultaneously during inference. Lower values reduce memory usage but limit concurrent adapter usage.
-
--max-num-steps <max_num_steps>βThe number of steps to run for multi-step scheduling.
-1specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (for example, embedding models).
-
--max-queue-size-tg <max_queue_size_tg>βMaximum number of requests in decode queue. By default, this is
max_batch_size.
-
--max-vision-cache-entries <max_vision_cache_entries>βMaximum number of images cached in the vision encoder cache. Each entry stores the vision encoder output for one image, avoiding re-encoding across chunks and requests. Set to
0to disable caching. Only used by VLMs.
-
--min-batch-size-tg <min_batch_size_tg>βSoft floor on the decode batch size. If the TG batch size is larger, the scheduler continues TG batches; if it falls below, the scheduler prioritizes CE. This is not a strict minimum. By default, this is
max_queue_size_tg. Experimental for the TTS scheduler.
-
--min-new-tokens <min_new_tokens>βMinimum number of tokens to generate in the response.
-
--min-p <min_p>βFloat that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.
-
--model, --model-path <model_path>βAccepts either a Hugging Face repository ID or a local path to the model.
-
--model-override <model_override>βPer-component overrides for the ModelManifest, in the format
component.field=value. Applied before resolution. Repeatable. Example:transformer.quantization_encoding=float4_e2m1fnx2.
-
--models <models>βThe model manifest containing all model configs keyed by role.
-
--num-speculative-tokens <num_speculative_tokens>βThe number of speculative tokens.
-
--num-warmups <num_warmups>βNumber of warmup iterations to run before the final timed run.
-
Default:
-
0
-
-
--pipeline-role <pipeline_role>βWhether the pipeline should serve both a prefill or decode role or both.
-
Options:
-
prefill_and_decode | prefill_only | decode_only
-
-
--pool-embeddings, --no-pool-embeddingsβWhether to pool embedding outputs.
-
--prefer-module-v3, --no-prefer-module-v3βWhether to prefer the eager API architecture over the graph API architecture. When
False(default), the inference server uses the graph API architecture. WhenTrue, the server uses the eager API architecture when available and falls back to the graph API architecture.
-
--presence-penalty <presence_penalty>βThe presence penalty to apply to the model's output. A positive value will penalize new tokens that have already appeared in the generated text at least once.
-
--prompt <prompt>βThe text prompt to use for further generation.
-
--quantization-encoding <quantization_encoding>βWeight encoding type. For GGUF models, the encoding is auto-detected from the repository when unset; if set, it must match an available encoding. When the repository contains multiple quantization formats, set this to choose one.
-
Options:
-
float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq
-
-
--reasoning-parser <reasoning_parser>βName of the reasoning output parser. The parser extracts thinking blocks to populate the
reasoningfield in chat completion responses.
-
--rejection-sampling-strategy <rejection_sampling_strategy>βRejection sampling strategy for verifying draft tokens. Defaults to
typical-acceptanceforeagle/mtpandresidualforstandalone.-
Options:
-
greedy | residual | typical-acceptance | logit-comparison
-
-
--relaxed-delta <relaxed_delta>βProbability gap below the top-1 candidate inside which candidates remain eligible for relaxed acceptance. A draft token is accepted if it matches any top-N candidate whose probability is at least
top1_prob - relaxed_delta. Ignored whenuse_relaxed_acceptance_for_thinkingisFalse.
-
--relaxed-topk <relaxed_topk>βTop-N candidates from the target distribution to consider when relaxed acceptance is active. Ignored when
use_relaxed_acceptance_for_thinkingisFalse.
-
--repetition-penalty <repetition_penalty>βThe repetition penalty to apply to the model's output. Values > 1 will penalize new tokens that have already appeared in the generated text at least once.
-
--rope-type <rope_type>βForce using a specific rope type. Only matters for GGUF weights.
-
Options:
-
none | normal | neox | longrope | yarn
-
-
--section-name <section_name>β
-
--seed <seed>βSeed for the random number generator.
-
--served-model-name <served_model_name>βOptional override for client-facing model name. Defaults to
model_path.
-
--speculative-method <speculative_method>βThe speculative decoding method to use.
-
Options:
-
standalone | eagle | mtp
-
-
--stop <stop>βA list of detokenized sequences that can be used as stop criteria when generating a new sequence. Can be specified multiple times.
-
--stop-token-ids <stop_token_ids>βA list of token ids that are used as stopping criteria when generating a new sequence. Comma-separated integers.
-
--subfolder <subfolder>βSubdirectory within the HuggingFace repo to load config and weights from (for example,
vaeortext_encoder). When set,config.jsonand weights are resolved from{model_path}/{subfolder}/.
-
--synthetic-acceptance-rate <synthetic_acceptance_rate>βSynthetic acceptance rate for benchmarking (
0.0to1.0). When set, the rejection sampler bypasses the real draft/target comparison and accepts each draft position with a calibrated probability so the mean joint acceptance acrossnum_speculative_tokenspositions matches this value.
-
--taylorseer, --no-taylorseerβEnable TaylorSeer cache optimization. Uses Taylor series prediction to skip full transformer passes on certain denoising steps.
-
--taylorseer-cache-interval <taylorseer_cache_interval>βSteps between full TaylorSeer computations. None uses the model-specific default (typically 5).
-
--taylorseer-max-order <taylorseer_max_order>βTaylor expansion order (1 or 2). Higher order uses second derivatives for more accurate prediction. None uses the model-specific default (typically 1).
-
--taylorseer-warmup-steps <taylorseer_warmup_steps>βNumber of warmup steps before TaylorSeer prediction begins. None uses the model-specific default (typically 4).
-
--teacache, --no-teacacheβEnable TeaCache cache optimization. Uses the timestep-aware modulated input change to decide when the FLUX.2 transformer backbone can be skipped.
-
--teacache-coefficients <teacache_coefficients>βPolynomial coefficients used to rescale TeaCache's relative-L1 metric. None uses the model-specific default coefficients.
-
--teacache-rel-l1-thresh <teacache_rel_l1_thresh>βRelative-L1 threshold used by TeaCache. None uses the model-specific default.
-
--temperature <temperature>βControls the randomness of the model's output; higher values produce more diverse responses.
-
--top-k <top_k>βLimits the sampling to the K most probable tokens. This defaults to 255. For greedy sampling, set to 1.
-
--top-p <top_p>βOnly use the tokens whose cumulative probability is within the top_p threshold. This applies to the top_k tokens.
-
--trust-remote-code, --no-trust-remote-codeβWhether or not to allow for custom modeling files on Hugging Face.
-
--use-experimental-kernels <use_experimental_kernels>βEnables using experimental Mojo kernels with
max serve. The kernels could be unstable or incorrect.
-
--use-relaxed-acceptance-for-thinking, --no-use-relaxed-acceptance-for-thinkingβEnables relaxed acceptance for speculative decoding draft positions inside a
<think>...</think>block. The target's top-N candidates (filtered by a probability thresholdtop1_prob - relaxed_delta) are compared against the draft token; matching any candidate accepts the draft. Outside the thinking span, the existing strict acceptance rule still applies.
-
--use-subgraphs, --no-use-subgraphsβWhether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.
-
--use-vendor-blas <use_vendor_blas>βEnables using vendor BLAS libraries (
cublas,hipblas, etc.) withmax serve. Currently, this just replacesmatmulcalls.
-
--use-vendor-ccl <use_vendor_ccl>βEnables using vendor CCL libraries (NCCL/RCCL) for collective operations such as allreduce in multi-GPU inference.
-
--vision-config-overrides <vision_config_overrides>βModel-specific vision configuration overrides. For example, for InternVL:
{"max_dynamic_patch": 24}.
-
--weight-path <weight_path>βOptional path or URL of the model weights to use.
-
--zmq-endpoint-base <zmq_endpoint_base>βPrefix for ZMQ endpoints used for IPC. This ensures unique endpoints across MAX Serve instances on the same host. Example:
lora_request_zmq_endpoint = f"{zmq_endpoint_base}-lora_request".
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!