Skip to main content

max warm-cache

Preloads and compiles the model to optimize initialization time by:

  • Pre-compiling models before deployment
  • Warming up the Hugging Face cache

This command is useful to run before serving a model.

For example:

max warm-cache \
  --model google/gemma-3-12b-it

Usage

max warm-cache [OPTIONS]

Options

  • --allow-safetensors-weights-fp32-bf6-bidirectional-cast, --no-allow-safetensors-weights-fp32-bf6-bidirectional-cast

    Specify whether to allow automatic float32 to bfloat16 safetensors weight type casting, if needed. Currently only supported in Llama3 models.

  • --cache-strategy <cache_strategy>

    Force a specific cache strategy: ‘paged’ or ‘continuous’. If not provided, the optimal caching strategy for the model requested will be selected.

    Options:

    KVCacheStrategy.MODEL_DEFAULT | KVCacheStrategy.PAGED

  • --ce-delay-ms <ce_delay_ms>

    Duration of scheduler sleep prior to starting a prefill batch. This is an experimental flag solely for the TTS scheduler. Default is 0.0.

  • --chat-template <chat_template>

  • --custom-architectures <custom_architectures>

    A list of custom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name.

  • --data-parallel-degree <data_parallel_degree>

  • --device-memory-utilization <device_memory_utilization>

    The fraction of available device memory that the process should consume. This is used to inform the size of the KVCache workspace: kv_cache_workspace = (total_free_memory * device_memory_utilization) - model_weights_size. Default is set to 0.9.

  • --devices <devices>

    Whether to run the model on CPU (–devices=cpu), GPU (–devices=gpu) or a list of GPUs (–devices=gpu:0,1) etc. An ID value can be provided optionally to indicate the device ID to target. If not provided, the model will run on the first available GPU (–devices=gpu), or CPU if no GPUs are available (–devices=cpu).

  • --draft-allow-safetensors-weights-fp32-bf6-bidirectional-cast, --no-draft-allow-safetensors-weights-fp32-bf6-bidirectional-cast

    Specify whether to allow automatic float32 to bfloat16 safetensors weight type casting, if needed. Currently only supported in Llama3 models.

  • --draft-data-parallel-degree <draft_data_parallel_degree>

  • --draft-devices <draft_devices>

    Whether to run the model on CPU (–devices=cpu), GPU (–devices=gpu) or a list of GPUs (–devices=gpu:0,1) etc. An ID value can be provided optionally to indicate the device ID to target. If not provided, the model will run on the first available GPU (–devices=gpu), or CPU if no GPUs are available (–devices=cpu).

  • --draft-force-download, --no-draft-force-download

    Specify whether to forcefully download a file even if it already exists in local cache. Set this to true if you want to ensure you have the latest version.

  • --draft-huggingface-model-revision <draft_huggingface_model_revision>

    Branch or Git revision of Hugging Face model repository to use.

  • --draft-huggingface-weight-revision <draft_huggingface_weight_revision>

    Branch or Git revision of Hugging Face weight repository to use.

  • --draft-model <draft_model>

    Specify the repository ID of a Hugging Face model repository to use. This is used to load both Tokenizers, architectures and model weights.

  • --draft-model-path <draft_model_path>

    Specify the repository ID of a Hugging Face model repository to use. This is used to load both Tokenizers, architectures and model weights. Equivalent to –model flag.

  • --draft-quantization-encoding <draft_quantization_encoding>

    Define the weight encoding type for quantization. This can help optimize performance and memory usage during inference. ie. q4_k, bfloat16 etc.

    Options:

    SupportedEncoding.float32 | SupportedEncoding.bfloat16 | SupportedEncoding.q4_k | SupportedEncoding.q4_0 | SupportedEncoding.q6_k | SupportedEncoding.float8_e4m3fn | SupportedEncoding.gptq

  • --draft-rope-type <draft_rope_type>

    Force using a specific rope type: ‘none’ | ‘normal’ | ‘neox’. Only matters for GGUF weights.

    Options:

    RopeType.none | RopeType.normal | RopeType.neox | RopeType.longrope | RopeType.yarn

  • --draft-served-model-name <draft_served_model_name>

    Optional override for client-facing model name. Defaults to model_path.

  • --draft-trust-remote-code, --no-draft-trust-remote-code

    Indicate whether to allow custom modelling files from Hugging Face repositories. Set this to true with caution, as it may introduce security risks.

  • --draft-use-subgraphs, --no-draft-use-subgraphs

    Whether to use subgraphs for the model. This could significantly reduce compile time especially for a large model with several identical blocks. Default is true.

  • --draft-vision-config-overrides <draft_vision_config_overrides>

    Model-specific vision configuration overrides. For example, for InternVL: {‘max_dynamic_patch’: 24}.

  • --draft-weight-path <draft_weight_path>

    Provide an optional local path or path relative to the root of a Hugging Face repo to the model weights you want to use. This allows you to specify custom weights instead of using defaults. You may pass multiple, ie. –weight-path=model-00001-of-00002.safetensors –weight-path=model-00002-of-00002.safetensors

  • --enable-chunked-prefill, --no-enable-chunked-prefill

    Enable chunked prefill to split context encoding requests into multiple chunks based on prefill-chunk-size. Default is true.

  • --enable-echo, --no-enable-echo

    Whether the model should be built with echo capabilities. This defaults to false.

  • --enable-in-flight-batching, --no-enable-in-flight-batching

    When enabled, prioritizes token generation by batching it with context encoding requests. Default is false.

  • --enable-kvcache-swapping-to-host, --no-enable-kvcache-swapping-to-host

    Whether to enable swapping the paged attention KVCache blocks to host memory when device blocks are evicted. This defaults to false.

  • --enable-lora, --no-enable-lora

    Enables LoRA on the server

  • --enable-min-tokens, --no-enable-min-tokens

    Whether to enable min_tokens, which blocks the model from generating stopping tokens before the min_tokens count is reached. This defaults to false.

  • --enable-penalties, --no-enable-penalties

    Whether to apply frequency and presence penalties to the model’s output. This defaults to false.

  • --enable-prefix-caching, --no-enable-prefix-caching

    Whether to enable prefix caching for the paged attention KVCache. Prefix caching is enabled by default for supported models.

  • --enable-prioritize-first-decode, --no-enable-prioritize-first-decode

    When enabled, the scheduler will always run a TG batch immediately after a CE batch, with the same requests. This may be useful for decreasing time-to-first-chunk latency. This is an experimental flag solely for the TTS scheduler. Default is false.

  • --enable-structured-output, --no-enable-structured-output

    Whether to enable constrained decoding in the text generation pipeline. This defaults to false.

  • --enable-variable-logits, --no-enable-variable-logits

    Whether to enable the sampling graph to accept a ragged tensor of different sequences as inputs, along with their associated logit_offsets. This is needed to produce additional logits for echo and speculative decoding purposes. This defaults to false.

  • --ep-size <ep_size>

  • --execute-empty-batches, --no-execute-empty-batches

  • --experimental-background-queue, --no-experimental-background-queue

    When enabled, offloads queue draining to a background thread for improved performance. This is an experimental flag. Default is false.

  • --force, --no-force

  • --force-download, --no-force-download

    Specify whether to forcefully download a file even if it already exists in local cache. Set this to true if you want to ensure you have the latest version.

  • --gpu-profiling <gpu_profiling>

    Whether to turn on GPU profiling for the model. This defaults to ‘off’.

    Options:

    GPUProfilingMode.OFF | GPUProfilingMode.ON | GPUProfilingMode.DETAILED

  • --host-kvcache-swap-space-gb <host_kvcache_swap_space_gb>

    The amount of host memory to use for the host KVCache in GiB. This is only used when kvcache_swapping_to_host is enabled. Default is set to 50.0.

  • --huggingface-model-revision <huggingface_model_revision>

    Branch or Git revision of Hugging Face model repository to use.

  • --huggingface-weight-revision <huggingface_weight_revision>

    Branch or Git revision of Hugging Face weight repository to use.

  • --kv-cache-page-size <kv_cache_page_size>

    The number of tokens in a single page in the paged KVCache. Default is set to 128.

  • --lora-paths <lora_paths>

    List of paths to the LoRAs.

  • --max-batch-context-length <max_batch_context_length>

    Ensures that the sum of the context length in a batch does not exceed max_batch_context_length. If None, the sum of the context length in batch is not limited.

  • --max-batch-size <max_batch_size>

    Define the maximum batch size to execute with the model. When not specified (None), we determine this value dynamically. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity.

  • --max-ce-batch-size <max_ce_batch_size>

    Set the maximum cache size reserved for a single context encoding batch. The effective limit will be the lesser of this value and max-batch-size. Default is 192.

  • --max-length <max_length>

    Set the maximum sequence length for input data processed by the model. This must be less than the value specified in the Hugging Face configuration file. The default is derived from the Hugging Face configuration value. Larger values may consume more memory.

  • --max-lora-rank <max_lora_rank>

    The maximum rank of all possible LoRAs. Typically 8 or 16. Default is 16.

  • --max-num-loras <max_num_loras>

    The maximum number of active LoRAs in a batch. Default is 1.

  • --max-num-steps <max_num_steps>

    Specify the number of steps to run for multi-step scheduling during inference. Default is -1 which specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models).

  • --max-queue-size-tg <max_queue_size_tg>

    Maximum number of requests in decode queue. By default, this is max-batch-size.

  • --min-batch-size-tg <min_batch_size_tg>

    Specifies a soft floor on the decode batch size. If the TG batch size is larger than this value, the scheduler will continue to run TG batches. If it falls below, the scheduler will prioritize CE. This is an experimental flag solely for the TTS scheduler.

  • --model <model>

    Specify the repository ID of a Hugging Face model repository to use. This is used to load both Tokenizers, architectures and model weights.

  • --model-path <model_path>

    Specify the repository ID of a Hugging Face model repository to use. This is used to load both Tokenizers, architectures and model weights. Equivalent to –model flag.

  • --pipeline-role <pipeline_role>

    Whether the pipeline should serve both a prefill or decode role or both.

    Options:

    PipelineRole.PrefillAndDecode | PipelineRole.PrefillOnly | PipelineRole.DecodeOnly

  • --pool-embeddings, --no-pool-embeddings

    Whether to pool embedding outputs. Default is true.

  • --prefill-chunk-size <prefill_chunk_size>

    The target number of un-encoded tokens to include in each batch. This value is used for chunked prefill and memory estimation. Default is 8192.

  • --quantization-encoding <quantization_encoding>

    Define the weight encoding type for quantization. This can help optimize performance and memory usage during inference. ie. q4_k, bfloat16 etc.

    Options:

    SupportedEncoding.float32 | SupportedEncoding.bfloat16 | SupportedEncoding.q4_k | SupportedEncoding.q4_0 | SupportedEncoding.q6_k | SupportedEncoding.float8_e4m3fn | SupportedEncoding.gptq

  • --rope-type <rope_type>

    Force using a specific rope type: ‘none’ | ‘normal’ | ‘neox’. Only matters for GGUF weights.

    Options:

    RopeType.none | RopeType.normal | RopeType.neox | RopeType.longrope | RopeType.yarn

  • --served-model-name <served_model_name>

    Optional override for client-facing model name. Defaults to model_path.

  • --trust-remote-code, --no-trust-remote-code

    Indicate whether to allow custom modelling files from Hugging Face repositories. Set this to true with caution, as it may introduce security risks.

  • --use-experimental-kernels <use_experimental_kernels>

    Whether to use experimental kernels. Default is false.

  • --use-subgraphs, --no-use-subgraphs

    Whether to use subgraphs for the model. This could significantly reduce compile time especially for a large model with several identical blocks. Default is true.

  • --use-vendor-blas <use_vendor_blas>

  • --vision-config-overrides <vision_config_overrides>

    Model-specific vision configuration overrides. For example, for InternVL: {‘max_dynamic_patch’: 24}.

  • --weight-path <weight_path>

    Provide an optional local path or path relative to the root of a Hugging Face repo to the model weights you want to use. This allows you to specify custom weights instead of using defaults. You may pass multiple, ie. –weight-path=model-00001-of-00002.safetensors –weight-path=model-00002-of-00002.safetensors

  • --zmq-endpoint-base <zmq_endpoint_base>