Skip to main content

max warm-cache

Preloads and compiles the model to optimize initialization time by:

  • Pre-compiling models before deployment
  • Warming up the Hugging Face cache

This command is useful to run before serving a model.

For example, compile and cache a model hosted on Hugging Face:

max warm-cache \
  --model google/gemma-3-12b-it

To compile for a target API and architecture without requiring matching physical hardware, pass --target (for example, cuda, cuda:sm_90, or hip:gfx942). MAX uses virtual devices for the compilation, which is useful when building MEF caches on a CI host that doesn't have the deployment hardware attached:

max warm-cache \
  --model google/gemma-3-12b-it \
  --target cuda:sm_90

Usage​

max warm-cache [OPTIONS]

Options​

  • --ce-delay-ms <ce_delay_ms>​

    Duration of scheduler sleep prior to starting a prefill batch. Experimental for the TTS scheduler.

  • --chat-template <chat_template>​

    Optional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If None, the model's default chat template is used.

  • --config-file <config_file>​

  • --custom-architectures <custom_architectures>​

    Custom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name. Each module must expose an ARCHITECTURES list of architectures to register.

  • --data-parallel-degree <data_parallel_degree>​

    Data-parallelism parameter. The degree to which the model is replicated is dependent on the model type.

  • --debug-verify-replay, --no-debug-verify-replay​

    When device_graph_capture is enabled, execute eager launch-trace verification before replay. Intended for debugging only.

  • --decode-stall-timeout-s <decode_stall_timeout_s>​

    Seconds of no-batch-activity after which the decode worker exits to trigger a pod restart. None (the default) disables the watchdog. Set with the MODULAR_DECODE_STALL_TIMEOUT_S environment variable.

  • --defer-resolve, --no-defer-resolve​

    Whether to defer resolving the pipeline config.

  • --device-graph-capture, --no-device-graph-capture​

    Enable device graph capture and replay for graph execution. If unset, automatically enabled for some selected architectures. Use --no-device-graph-capture to explicitly disable.

  • --device-memory-utilization <device_memory_utilization>​

    The fraction of available device memory that the process should consume. This informs the KVCache workspace size: kv_cache_workspace = (total_free_memory * device_memory_utilization) - model_weights_size.

  • --devices <devices>​

    Whether to run the model on CPU (--devices=cpu), GPU (--devices=gpu) or a list of GPUs (--devices=gpu:0,1). An ID value can be provided optionally to indicate the device ID to target. If not provided, the model will run on the first available GPU, or CPU if no GPUs are available.

  • --draft-chat-template <draft_chat_template>​

    Optional custom chat template to override the one shipped with the Hugging Face model config. If a path is provided, the file is read during config resolution and the content stored as a string. If None, the model's default chat template is used.

  • --draft-config-file <draft_config_file>​

  • --draft-data-parallel-degree <draft_data_parallel_degree>​

    Data-parallelism parameter. The degree to which the model is replicated is dependent on the model type.

  • --draft-devices <draft_devices>​

    Devices for the draft model in speculative decoding. If not provided, inherits from --devices. Accepts the same format as --devices.

  • --draft-enable-echo, --no-draft-enable-echo​

    Whether the model should be built with echo capabilities.

  • --draft-force-download, --no-draft-force-download​

    Whether to force download a given file if it's already present in the local cache.

  • --draft-huggingface-model-revision <draft_huggingface_model_revision>​

    Branch or Git revision of Hugging Face model repository to use.

  • --draft-huggingface-weight-revision <draft_huggingface_weight_revision>​

    Branch or Git revision of Hugging Face model repository to use.

  • --draft-max-length <draft_max_length>​

    Maximum sequence length the model can process. If not specified, defaults to the model's max_position_embeddings. May be clamped during resolution based on available memory.

  • --draft-model-path <draft_model_path>​

    Accepts either a Hugging Face repository ID or a local path to the model.

  • --draft-pool-embeddings, --no-draft-pool-embeddings​

    Whether to pool embedding outputs.

  • --draft-quantization-encoding <draft_quantization_encoding>​

    Weight encoding type. For GGUF models, the encoding is auto-detected from the repository when unset; if set, it must match an available encoding. When the repository contains multiple quantization formats, set this to choose one.

    Options:

    float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq

  • --draft-rope-type <draft_rope_type>​

    Force using a specific rope type. Only matters for GGUF weights.

    Options:

    none | normal | neox | longrope | yarn

  • --draft-section-name <draft_section_name>​

  • --draft-served-model-name <draft_served_model_name>​

    Optional override for client-facing model name. Defaults to model_path.

  • --draft-subfolder <draft_subfolder>​

    Subdirectory within the HuggingFace repo to load config and weights from (for example, vae or text_encoder). When set, config.json and weights are resolved from {model_path}/{subfolder}/.

  • --draft-trust-remote-code, --no-draft-trust-remote-code​

    Whether or not to allow for custom modeling files on Hugging Face.

  • --draft-use-subgraphs, --no-draft-use-subgraphs​

    Whether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.

  • --draft-vision-config-overrides <draft_vision_config_overrides>​

    Model-specific vision configuration overrides. For example, for InternVL: {"max_dynamic_patch": 24}.

  • --draft-weight-path <draft_weight_path>​

    Optional path or URL of the model weights to use.

  • --enable-chunked-prefill, --no-enable-chunked-prefill​

    Enable chunked prefill to split context encoding requests into multiple chunks based on max_batch_input_tokens.

  • --enable-echo, --no-enable-echo​

    Whether the model should be built with echo capabilities.

  • --enable-in-flight-batching, --no-enable-in-flight-batching​

    When enabled, prioritizes token generation by batching it with context encoding requests.

  • --enable-lora, --no-enable-lora​

    Enables LoRA on the server.

  • --enable-min-tokens, --no-enable-min-tokens​

    Whether to enable min_tokens, which blocks the model from generating stopping tokens before the min_tokens count is reached.

  • --enable-overlap-scheduler, --no-enable-overlap-scheduler​

    Whether to enable the overlap scheduler. This feature allows the scheduler to run alongside GPU execution. This helps improve GPU utilization. This is an experimental feature which may crash and burn. This feature will be enabled by default for some selected architectures. You can forcibly disable this by setting --no-enable-overlap-scheduler --force.

  • --enable-penalties, --no-enable-penalties​

    Whether to apply frequency and presence penalties to the model's output.

  • --enable-prefix-caching, --no-enable-prefix-caching​

    Whether to enable prefix caching for the paged KVCache.

  • --enable-prioritize-first-decode, --no-enable-prioritize-first-decode​

    When enabled, the scheduler always runs a TG batch immediately after a CE batch with the same requests. This may reduce time-to-first-chunk latency. Experimental for the TTS scheduler.

  • --enable-structured-output, --no-enable-structured-output​

    Enable structured generation/guided decoding for the server. This allows the user to pass a JSON schema in the response_format field, which the LLM will adhere to.

  • --enable-variable-logits, --no-enable-variable-logits​

    Enable the sampling graph to accept a ragged tensor of different sequences as inputs, along with their associated logit_offsets. This is needed to produce additional logits for echo and speculative decoding purposes.

  • --ep-size <ep_size>​

    The expert parallelism size. Needs to be 1 (no expert parallelism) or the total number of GPUs across nodes.

  • --ep-use-allreduce, --no-ep-use-allreduce​

    Whether to use allreduce for the cross-device communication in expert parallelism.

  • --execute-empty-batches, --no-execute-empty-batches​

    Whether the scheduler should execute empty batches.

  • --first-block-caching, --no-first-block-caching​

    Enable First-Block Cache (FBCache) for step-cache denoising. When enabled, the transformer skips remaining blocks if the first-block residual is similar to the previous step.

  • --force, --no-force​

    Skip validation of user provided flags against the architecture's required arguments.

  • --force-download, --no-force-download​

    Whether to force download a given file if it's already present in the local cache.

  • --gpu-profiling <gpu_profiling>​

    Whether to enable GPU profiling of the model.

    Options:

    off | on | detailed

  • --huggingface-model-revision <huggingface_model_revision>​

    Branch or Git revision of Hugging Face model repository to use.

  • --huggingface-weight-revision <huggingface_weight_revision>​

    Branch or Git revision of Hugging Face model repository to use.

  • --kv-cache-format <kv_cache_format>​

    Override the default data type for the KV cache. Supported values: float32, bfloat16, float8_e4m3fn.

  • --kv-cache-page-size <kv_cache_page_size>​

    The number of tokens in a single page in the paged KVCache.

  • --kv-connector <kv_connector>​

    Type of KV cache connector to use. When not set, defaults to null (no external caching).

    Options:

    KVConnectorType.null | KVConnectorType.local | KVConnectorType.tiered | KVConnectorType.lmcache | KVConnectorType.dkv

  • --kv-connector-config <kv_connector_config>​

    Connector-specific configuration overrides as inline JSON or path to a YAML/JSON file. Each connector type has sensible defaults, so this is only needed for customization.

  • --kvcache-ce-watermark <kvcache_ce_watermark>​

    Projected cache usage threshold for scheduling CE requests, considering current and incoming requests. CE is scheduled if either projected usage stays below this threshold or no active requests exist. Higher values can cause more preemptions.

  • --lora-paths <lora_paths>​

    List of statically defined LoRA paths.

  • --max-batch-input-tokens <max_batch_input_tokens>​

    The target number of un-encoded tokens to include in each batch. This value is used for chunked prefill and memory estimation.

  • --max-batch-size <max_batch_size>​

    Maximum batch size to execute with the model. When not specified (None), this value is determined dynamically. For server launches, set this higher based on server capacity.

  • --max-batch-total-tokens <max_batch_total_tokens>​

    Ensures the sum of page-aligned context lengths in a batch does not exceed max_batch_total_tokens. Alignment uses the KV cache page size. If None, the sum is not limited.

  • --max-length <max_length>​

    Maximum sequence length the model can process. If not specified, defaults to the model's max_position_embeddings. May be clamped during resolution based on available memory.

  • --max-lora-rank <max_lora_rank>​

    Maximum rank of all possible LoRAs.

  • --max-num-loras <max_num_loras>​

    The maximum number of active LoRAs in a batch. This controls how many LoRA adapters can be active simultaneously during inference. Lower values reduce memory usage but limit concurrent adapter usage.

  • --max-num-steps <max_num_steps>​

    The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (for example, embedding models).

  • --max-queue-size-tg <max_queue_size_tg>​

    Maximum number of requests in decode queue. By default, this is max_batch_size.

  • --max-vision-cache-entries <max_vision_cache_entries>​

    Maximum number of images cached in the vision encoder cache. Each entry stores the vision encoder output for one image, avoiding re-encoding across chunks and requests. Set to 0 to disable caching. Only used by VLMs.

  • --min-batch-size-tg <min_batch_size_tg>​

    Soft floor on the decode batch size. If the TG batch size is larger, the scheduler continues TG batches; if it falls below, the scheduler prioritizes CE. This is not a strict minimum. By default, this is max_queue_size_tg. Experimental for the TTS scheduler.

  • --model, --model-path <model_path>​

    Accepts either a Hugging Face repository ID or a local path to the model.

  • --model-override <model_override>​

    Per-component overrides for the ModelManifest, in the format component.field=value. Applied before resolution. Repeatable. Example: transformer.quantization_encoding=float4_e2m1fnx2.

  • --models <models>​

    The model manifest containing all model configs keyed by role.

  • --num-speculative-tokens <num_speculative_tokens>​

    The number of speculative tokens.

  • --pipeline-role <pipeline_role>​

    Whether the pipeline should serve both a prefill or decode role or both.

    Options:

    prefill_and_decode | prefill_only | decode_only

  • --pool-embeddings, --no-pool-embeddings​

    Whether to pool embedding outputs.

  • --prefer-module-v3, --no-prefer-module-v3​

    Whether to prefer the eager API architecture over the graph API architecture. When False (default), the inference server uses the graph API architecture. When True, the server uses the eager API architecture when available and falls back to the graph API architecture.

  • --quantization-encoding <quantization_encoding>​

    Weight encoding type. For GGUF models, the encoding is auto-detected from the repository when unset; if set, it must match an available encoding. When the repository contains multiple quantization formats, set this to choose one.

    Options:

    float32 | bfloat16 | q4_k | q4_0 | q6_k | float8_e4m3fn | float4_e2m1fnx2 | gptq

  • --reasoning-parser <reasoning_parser>​

    Name of the reasoning output parser. The parser extracts thinking blocks to populate the reasoning field in chat completion responses.

  • --rejection-sampling-strategy <rejection_sampling_strategy>​

    Rejection sampling strategy for verifying draft tokens. Defaults to typical-acceptance for eagle/mtp and residual for standalone.

    Options:

    greedy | residual | typical-acceptance | logit-comparison

  • --relaxed-delta <relaxed_delta>​

    Probability gap below the top-1 candidate inside which candidates remain eligible for relaxed acceptance. A draft token is accepted if it matches any top-N candidate whose probability is at least top1_prob - relaxed_delta. Ignored when use_relaxed_acceptance_for_thinking is False.

  • --relaxed-topk <relaxed_topk>​

    Top-N candidates from the target distribution to consider when relaxed acceptance is active. Ignored when use_relaxed_acceptance_for_thinking is False.

  • --rope-type <rope_type>​

    Force using a specific rope type. Only matters for GGUF weights.

    Options:

    none | normal | neox | longrope | yarn

  • --section-name <section_name>​

  • --served-model-name <served_model_name>​

    Optional override for client-facing model name. Defaults to model_path.

  • --speculative-method <speculative_method>​

    The speculative decoding method to use.

    Options:

    standalone | eagle | mtp

  • --subfolder <subfolder>​

    Subdirectory within the HuggingFace repo to load config and weights from (for example, vae or text_encoder). When set, config.json and weights are resolved from {model_path}/{subfolder}/.

  • --synthetic-acceptance-rate <synthetic_acceptance_rate>​

    Synthetic acceptance rate for benchmarking (0.0 to 1.0). When set, the rejection sampler bypasses the real draft/target comparison and accepts each draft position with a calibrated probability so the mean joint acceptance across num_speculative_tokens positions matches this value.

  • --target <target>​

    Target API and architecture to compile for (e.g., cuda, cuda:sm_90, hip:gfx942). When specified, uses virtual devices for compilation without requiring physical hardware.

  • --taylorseer, --no-taylorseer​

    Enable TaylorSeer cache optimization. Uses Taylor series prediction to skip full transformer passes on certain denoising steps.

  • --taylorseer-cache-interval <taylorseer_cache_interval>​

    Steps between full TaylorSeer computations. None uses the model-specific default (typically 5).

  • --taylorseer-max-order <taylorseer_max_order>​

    Taylor expansion order (1 or 2). Higher order uses second derivatives for more accurate prediction. None uses the model-specific default (typically 1).

  • --taylorseer-warmup-steps <taylorseer_warmup_steps>​

    Number of warmup steps before TaylorSeer prediction begins. None uses the model-specific default (typically 4).

  • --teacache, --no-teacache​

    Enable TeaCache cache optimization. Uses the timestep-aware modulated input change to decide when the FLUX.2 transformer backbone can be skipped.

  • --teacache-coefficients <teacache_coefficients>​

    Polynomial coefficients used to rescale TeaCache's relative-L1 metric. None uses the model-specific default coefficients.

  • --teacache-rel-l1-thresh <teacache_rel_l1_thresh>​

    Relative-L1 threshold used by TeaCache. None uses the model-specific default.

  • --trust-remote-code, --no-trust-remote-code​

    Whether or not to allow for custom modeling files on Hugging Face.

  • --use-experimental-kernels <use_experimental_kernels>​

    Enables using experimental Mojo kernels with max serve. The kernels could be unstable or incorrect.

  • --use-relaxed-acceptance-for-thinking, --no-use-relaxed-acceptance-for-thinking​

    Enables relaxed acceptance for speculative decoding draft positions inside a <think>...</think> block. The target's top-N candidates (filtered by a probability threshold top1_prob - relaxed_delta) are compared against the draft token; matching any candidate accepts the draft. Outside the thinking span, the existing strict acceptance rule still applies.

  • --use-subgraphs, --no-use-subgraphs​

    Whether to use subgraphs for the model. This can significantly reduce compile time, especially for large models with identical blocks. Default is true.

  • --use-vendor-blas <use_vendor_blas>​

    Enables using vendor BLAS libraries (cublas, hipblas, etc.) with max serve. Currently, this just replaces matmul calls.

  • --use-vendor-ccl <use_vendor_ccl>​

    Enables using vendor CCL libraries (NCCL/RCCL) for collective operations such as allreduce in multi-GPU inference.

  • --vision-config-overrides <vision_config_overrides>​

    Model-specific vision configuration overrides. For example, for InternVL: {"max_dynamic_patch": 24}.

  • --weight-path <weight_path>​

    Optional path or URL of the model weights to use.

  • --zmq-endpoint-base <zmq_endpoint_base>​

    Prefix for ZMQ endpoints used for IPC. This ensures unique endpoints across MAX Serve instances on the same host. Example: lora_request_zmq_endpoint = f"{zmq_endpoint_base}-lora_request".