max warm-cache
Preloads and compiles the model to optimize initialization time by:
- Pre-compiling models before deployment
- Warming up the Hugging Face cache
This command is useful to run before serving a model.
For example, compile and cache a model hosted on Hugging Face:
max warm-cache \
--model google/gemma-3-12b-itTo compile for a target API and architecture without requiring matching
physical hardware, pass --target (for example, cuda,
cuda:sm_90, or hip:gfx942). MAX uses virtual devices for the
compilation, which is useful when building MEF caches on a CI host that
doesn't have the deployment hardware attached:
max warm-cache \
--model google/gemma-3-12b-it \
--target cuda:sm_90Usageβ
max warm-cache [OPTIONS]Optionsβ
-
--ce-delay-ms <ce_delay_ms>βDuration of scheduler sleep prior to starting a prefill batch. Experimental for the TTS scheduler.
-
--chat-template <chat_template>βOptional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If
None, the model's default chat template is used.
-
--config-file <config_file>β
-
--custom-architectures <custom_architectures>βCustom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name. Each module must expose an
ARCHITECTURESlist of architectures to register.
-
--data-parallel-degree <data_parallel_degree>βData-parallelism parameter. The degree to which the model is replicated is dependent on the model type.
-
--debug-verify-replay, --no-debug-verify-replayβWhen
device_graph_captureis enabled, execute eager launch-trace verification before replay. Intended for debugging only.
-
--decode-stall-timeout-s <decode_stall_timeout_s>βSeconds of no-batch-activity after which the decode worker exits to trigger a pod restart.
None(the default) disables the watchdog. Set with theMODULAR_DECODE_STALL_TIMEOUT_Senvironment variable.
-
--defer-resolve, --no-defer-resolveβWhether to defer resolving the pipeline config.
-
--device-graph-capture, --no-device-graph-captureβEnable device graph capture and replay for graph execution. If unset, automatically enabled for some selected architectures. Use
--no-device-graph-captureto explicitly disable.
-
--device-memory-utilization <device_memory_utilization>βThe fraction of available device memory that the process should consume. This informs the KVCache workspace size:
kv_cache_workspace = (total_free_memory * device_memory_utilization) - model_weights_size.
-
--devices <devices>βWhether to run the model on CPU (
--devices=cpu), GPU (--devices=gpu) or a list of GPUs (--devices=gpu:0,1). An ID value can be provided optionally to indicate the device ID to target. If not provided, the model will run on the first available GPU, or CPU if no GPUs are available.
-
--draft-chat-template <draft_chat_template>βOptional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If
None, the model's default chat template is used.
-
--draft-config-file <draft_config_file>β
-
--draft-data-parallel-degree <draft_data_parallel_degree>βData-parallelism parameter. The degree to which the model is replicated is dependent on the model type.
-
--draft-devices <draft_devices>βDevices for the draft model in speculative decoding. If not provided, inherits from
--devices. Accepts the same format as--devices.
-
--draft-enable-echo, --no-draft-enable-echoβWhether the model should be built with echo capabilities.
-
--draft-force-download, --no-draft-force-downloadβWhether to force download a given file if it's already present in the local cache.
-
--draft-huggingface-model-revision <draft_huggingface_model_revision>βBranch or Git revision of Hugging Face model repository to use.
-
--draft-huggingface-weight-revision <draft_huggingface_weight_revision>βBranch or Git revision of Hugging Face model repository to use.
-
--draft-max-length <draft_max_length>βMaximum sequence length the model can process. If not specified, defaults to the model's
max_position_embeddings. May be clamped during resolution based on available memory.
-
--draft-model-path <draft_model_path>βAccepts either a Hugging Face repository ID or a local path to the model.
-
--draft-pool-embeddings, --no-draft-pool-embeddingsβWhether to pool embedding outputs.
-
--draft-quantization-encoding <draft_quantization_encoding>βWeight encoding type. For GGUF models, the encoding is auto-detected from the repository when unset; if set, it must match an available encoding. When the repository contains multiple quantization formats, set this to choose one.
-
Options:
-
float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq
-
-
--draft-rope-type <draft_rope_type>βForce using a specific rope type. Only matters for GGUF weights.
-
Options:
-
none | normal | neox | longrope | yarn
-
-
--draft-section-name <draft_section_name>β
-
--draft-served-model-name <draft_served_model_name>βOptional override for client-facing model name. Defaults to
model_path.
-
--draft-subfolder <draft_subfolder>βSubdirectory within the HuggingFace repo to load config and weights from (for example,
vaeortext_encoder). When set,config.jsonand weights are resolved from{model_path}/{subfolder}/.
-
--draft-trust-remote-code, --no-draft-trust-remote-codeβWhether or not to allow for custom modeling files on Hugging Face.
-
--draft-use-subgraphs, --no-draft-use-subgraphsβWhether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.
-
--draft-vision-config-overrides <draft_vision_config_overrides>βModel-specific vision configuration overrides. For example, for InternVL:
{"max_dynamic_patch": 24}.
-
--draft-weight-path <draft_weight_path>βOptional path or URL of the model weights to use.
-
--enable-chunked-prefill, --no-enable-chunked-prefillβEnable chunked prefill to split context encoding requests into multiple chunks based on
max_batch_input_tokens.
-
--enable-echo, --no-enable-echoβWhether the model should be built with echo capabilities.
-
--enable-in-flight-batching, --no-enable-in-flight-batchingβWhen enabled, prioritizes token generation by batching it with context encoding requests.
-
--enable-lora, --no-enable-loraβEnables LoRA on the server.
-
--enable-min-tokens, --no-enable-min-tokensβWhether to enable
min_tokens, which blocks the model from generating stopping tokens before themin_tokenscount is reached.
-
--enable-overlap-scheduler, --no-enable-overlap-schedulerβWhether to enable the overlap scheduler. This feature allows the scheduler to run alongside GPU execution. This helps improve GPU utilization. This is an experimental feature which may crash and burn. This feature will be enabled by default for some selected architectures. You can forcibly disable this by setting
--no-enable-overlap-scheduler --force.
-
--enable-penalties, --no-enable-penaltiesβWhether to apply frequency and presence penalties to the model's output.
-
--enable-prefix-caching, --no-enable-prefix-cachingβWhether to enable prefix caching for the paged KVCache.
-
--enable-prioritize-first-decode, --no-enable-prioritize-first-decodeβWhen enabled, the scheduler always runs a TG batch immediately after a CE batch with the same requests. This may reduce time-to-first-chunk latency. Experimental for the TTS scheduler.
-
--enable-structured-output, --no-enable-structured-outputβEnable structured generation/guided decoding for the server. This allows the user to pass a JSON schema in the
response_formatfield, which the LLM will adhere to.
-
--enable-variable-logits, --no-enable-variable-logitsβEnable the sampling graph to accept a ragged tensor of different sequences as inputs, along with their associated
logit_offsets. This is needed to produce additional logits for echo and speculative decoding purposes.
-
--ep-size <ep_size>βThe expert parallelism size. Needs to be 1 (no expert parallelism) or the total number of GPUs across nodes.
-
--ep-use-allreduce, --no-ep-use-allreduceβWhether to use allreduce for the cross-device communication in expert parallelism.
-
--execute-empty-batches, --no-execute-empty-batchesβWhether the scheduler should execute empty batches.
-
--first-block-caching, --no-first-block-cachingβEnable First-Block Cache (FBCache) for step-cache denoising. When enabled, the transformer skips remaining blocks if the first-block residual is similar to the previous step.
-
--force, --no-forceβSkip validation of user provided flags against the architecture's required arguments.
-
--force-download, --no-force-downloadβWhether to force download a given file if it's already present in the local cache.
-
--gpu-profiling <gpu_profiling>βWhether to enable GPU profiling of the model.
-
Options:
-
off | on | detailed
-
-
--huggingface-model-revision <huggingface_model_revision>βBranch or Git revision of Hugging Face model repository to use.
-
--huggingface-weight-revision <huggingface_weight_revision>βBranch or Git revision of Hugging Face model repository to use.
-
--kv-cache-format <kv_cache_format>βOverride the default data type for the KV cache. Supported values:
float32,bfloat16,float8_e4m3fn.
-
--kv-cache-page-size <kv_cache_page_size>βThe number of tokens in a single page in the paged KVCache.
-
--kv-connector <kv_connector>βType of KV cache connector to use. When not set, defaults to
null(no external caching).-
Options:
-
KVConnectorType.null | KVConnectorType.local | KVConnectorType.tiered | KVConnectorType.lmcache | KVConnectorType.dkv
-
-
--kv-connector-config <kv_connector_config>βConnector-specific configuration overrides as inline JSON or path to a YAML/JSON file. Each connector type has sensible defaults, so this is only needed for customization.
-
--kvcache-ce-watermark <kvcache_ce_watermark>βProjected cache usage threshold for scheduling CE requests, considering current and incoming requests. CE is scheduled if either projected usage stays below this threshold or no active requests exist. Higher values can cause more preemptions.
-
--lora-paths <lora_paths>βList of statically defined LoRA paths.
-
--max-batch-input-tokens <max_batch_input_tokens>βThe target number of un-encoded tokens to include in each batch. This value is used for chunked prefill and memory estimation.
-
--max-batch-size <max_batch_size>βMaximum batch size to execute with the model. When not specified (
None), this value is determined dynamically. For server launches, set this higher based on server capacity.
-
--max-batch-total-tokens <max_batch_total_tokens>βEnsures the sum of page-aligned context lengths in a batch does not exceed
max_batch_total_tokens. Alignment uses the KV cache page size. IfNone, the sum is not limited.
-
--max-length <max_length>βMaximum sequence length the model can process. If not specified, defaults to the model's
max_position_embeddings. May be clamped during resolution based on available memory.
-
--max-lora-rank <max_lora_rank>βMaximum rank of all possible LoRAs.
-
--max-num-loras <max_num_loras>βThe maximum number of active LoRAs in a batch. This controls how many LoRA adapters can be active simultaneously during inference. Lower values reduce memory usage but limit concurrent adapter usage.
-
--max-num-steps <max_num_steps>βThe number of steps to run for multi-step scheduling.
-1specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (for example, embedding models).
-
--max-queue-size-tg <max_queue_size_tg>βMaximum number of requests in decode queue. By default, this is
max_batch_size.
-
--max-vision-cache-entries <max_vision_cache_entries>βMaximum number of images cached in the vision encoder cache. Each entry stores the vision encoder output for one image, avoiding re-encoding across chunks and requests. Set to
0to disable caching. Only used by VLMs.
-
--min-batch-size-tg <min_batch_size_tg>βSoft floor on the decode batch size. If the TG batch size is larger, the scheduler continues TG batches; if it falls below, the scheduler prioritizes CE. This is not a strict minimum. By default, this is
max_queue_size_tg. Experimental for the TTS scheduler.
-
--model, --model-path <model_path>βAccepts either a Hugging Face repository ID or a local path to the model.
-
--model-override <model_override>βPer-component overrides for the ModelManifest, in the format
component.field=value. Applied before resolution. Repeatable. Example:transformer.quantization_encoding=float4_e2m1fnx2.
-
--models <models>βThe model manifest containing all model configs keyed by role.
-
--num-speculative-tokens <num_speculative_tokens>βThe number of speculative tokens.
-
--pipeline-role <pipeline_role>βWhether the pipeline should serve both a prefill or decode role or both.
-
Options:
-
prefill_and_decode | prefill_only | decode_only
-
-
--pool-embeddings, --no-pool-embeddingsβWhether to pool embedding outputs.
-
--prefer-module-v3, --no-prefer-module-v3βWhether to prefer the eager API architecture over the graph API architecture. When
False(default), the inference server uses the graph API architecture. WhenTrue, the server uses the eager API architecture when available and falls back to the graph API architecture.
-
--quantization-encoding <quantization_encoding>βWeight encoding type. For GGUF models, the encoding is auto-detected from the repository when unset; if set, it must match an available encoding. When the repository contains multiple quantization formats, set this to choose one.
-
Options:
-
float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq
-
-
--reasoning-parser <reasoning_parser>βName of the reasoning output parser. The parser extracts thinking blocks to populate the
reasoningfield in chat completion responses.
-
--rejection-sampling-strategy <rejection_sampling_strategy>βRejection sampling strategy for verifying draft tokens. Defaults to
typical-acceptanceforeagle/mtpandresidualforstandalone.-
Options:
-
greedy | residual | typical-acceptance | logit-comparison
-
-
--relaxed-delta <relaxed_delta>βProbability gap below the top-1 candidate inside which candidates remain eligible for relaxed acceptance. A draft token is accepted if it matches any top-N candidate whose probability is at least
top1_prob - relaxed_delta. Ignored whenuse_relaxed_acceptance_for_thinkingisFalse.
-
--relaxed-topk <relaxed_topk>βTop-N candidates from the target distribution to consider when relaxed acceptance is active. Ignored when
use_relaxed_acceptance_for_thinkingisFalse.
-
--rope-type <rope_type>βForce using a specific rope type. Only matters for GGUF weights.
-
Options:
-
none | normal | neox | longrope | yarn
-
-
--section-name <section_name>β
-
--served-model-name <served_model_name>βOptional override for client-facing model name. Defaults to
model_path.
-
--speculative-method <speculative_method>βThe speculative decoding method to use.
-
Options:
-
standalone | eagle | mtp
-
-
--subfolder <subfolder>βSubdirectory within the HuggingFace repo to load config and weights from (for example,
vaeortext_encoder). When set,config.jsonand weights are resolved from{model_path}/{subfolder}/.
-
--synthetic-acceptance-rate <synthetic_acceptance_rate>βSynthetic acceptance rate for benchmarking (
0.0to1.0). When set, the rejection sampler bypasses the real draft/target comparison and accepts each draft position with a calibrated probability so the mean joint acceptance acrossnum_speculative_tokenspositions matches this value.
-
--target <target>βTarget API and architecture to compile for (e.g., cuda, cuda:sm_90, hip:gfx942). When specified, uses virtual devices for compilation without requiring physical hardware.
-
--taylorseer, --no-taylorseerβEnable TaylorSeer cache optimization. Uses Taylor series prediction to skip full transformer passes on certain denoising steps.
-
--taylorseer-cache-interval <taylorseer_cache_interval>βSteps between full TaylorSeer computations. None uses the model-specific default (typically 5).
-
--taylorseer-max-order <taylorseer_max_order>βTaylor expansion order (1 or 2). Higher order uses second derivatives for more accurate prediction. None uses the model-specific default (typically 1).
-
--taylorseer-warmup-steps <taylorseer_warmup_steps>βNumber of warmup steps before TaylorSeer prediction begins. None uses the model-specific default (typically 4).
-
--teacache, --no-teacacheβEnable TeaCache cache optimization. Uses the timestep-aware modulated input change to decide when the FLUX.2 transformer backbone can be skipped.
-
--teacache-coefficients <teacache_coefficients>βPolynomial coefficients used to rescale TeaCache's relative-L1 metric. None uses the model-specific default coefficients.
-
--teacache-rel-l1-thresh <teacache_rel_l1_thresh>βRelative-L1 threshold used by TeaCache. None uses the model-specific default.
-
--trust-remote-code, --no-trust-remote-codeβWhether or not to allow for custom modeling files on Hugging Face.
-
--use-experimental-kernels <use_experimental_kernels>βEnables using experimental Mojo kernels with
max serve. The kernels could be unstable or incorrect.
-
--use-relaxed-acceptance-for-thinking, --no-use-relaxed-acceptance-for-thinkingβEnables relaxed acceptance for speculative decoding draft positions inside a
<think>...</think>block. The target's top-N candidates (filtered by a probability thresholdtop1_prob - relaxed_delta) are compared against the draft token; matching any candidate accepts the draft. Outside the thinking span, the existing strict acceptance rule still applies.
-
--use-subgraphs, --no-use-subgraphsβWhether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.
-
--use-vendor-blas <use_vendor_blas>βEnables using vendor BLAS libraries (
cublas,hipblas, etc.) withmax serve. Currently, this just replacesmatmulcalls.
-
--use-vendor-ccl <use_vendor_ccl>βEnables using vendor CCL libraries (NCCL/RCCL) for collective operations such as allreduce in multi-GPU inference.
-
--vision-config-overrides <vision_config_overrides>βModel-specific vision configuration overrides. For example, for InternVL:
{"max_dynamic_patch": 24}.
-
--weight-path <weight_path>βOptional path or URL of the model weights to use.
-
--zmq-endpoint-base <zmq_endpoint_base>βPrefix for ZMQ endpoints used for IPC. This ensures unique endpoints across MAX Serve instances on the same host. Example:
lora_request_zmq_endpoint = f"{zmq_endpoint_base}-lora_request".
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!