Python module
null_cache_manager
Null KV cache manager for compile-only mode.
This module provides a no-op KV cache manager that is used during compile-only mode when running with virtual devices. It avoids GPU memory allocation while still providing the necessary interface for graph construction.
NullKVCacheManager
class max.kv_cache.null_cache_manager.NullKVCacheManager(params, max_batch_size, max_seq_len, num_layers, devices, session, available_cache_memory, page_size=128)
A no-op KV cache manager for compile-only mode.
This manager is used when compiling models with virtual devices and does not allocate any GPU memory. It provides dummy implementations of the KV cache interface to allow graph construction and compilation without requiring physical GPU hardware or actual memory allocation.
This is particularly useful for cross-compilation scenarios where you want to compile models for GPU execution on a machine without a physical GPU present.
Initializes the null KV cache manager.
-
Parameters:
-
- params (KVCacheParams) – The KV cache parameters for the pipeline.
- max_batch_size (int) – The maximum batch size to support.
- max_seq_len (int) – The maximum sequence length to support.
- num_layers (int) – The number of transformer layers in the model.
- devices (Sequence[Device]) – The list of virtual devices.
- session (InferenceSession) – The inference session for graph operations.
- available_cache_memory (int) – The nominal available cache memory in bytes.
- page_size (int) – The page size in tokens. Defaults to 128.
alloc()
alloc(data, num_steps=1)
Allocates blocks for a request to run for N steps.
-
Parameters:
-
- data (TextGenerationContext)
- num_steps (int)
-
Return type:
-
None
claim()
claim(request_id, replica_idx=None)
Externally claim cache blocks (no-op for null manager).
contains()
contains(request_id)
Check if a request is in the cache.
estimated_memory_size()
classmethod estimated_memory_size(params, max_batch_size, max_seq_len, num_layers, available_cache_memory, devices, **kwargs)
Estimate memory size (returns 0 for null manager).
-
Parameters:
-
- params (KVCacheParams) – KV cache parameters
- max_batch_size (int) – Maximum batch size
- max_seq_len (int) – Maximum sequence length
- num_layers (int) – Number of layers
- available_cache_memory (int) – Available cache memory
- devices (Sequence[Device]) – List of devices
- **kwargs (Any) – Additional arguments
-
Returns:
-
Always returns 0 (no memory used)
-
Return type:
free_blocks_pct
property free_blocks_pct: float
Get percentage of free blocks.
-
Returns:
-
Always returns 1.0 (100%)
get_data_parallel_splits()
get_data_parallel_splits(batch)
Get data parallel splits for a batch.
-
Parameters:
-
batch (Sequence[TextGenerationContext]) – Batch of contexts
-
Returns:
-
Single split containing all batch indices
-
Return type:
get_or_recommend_replica()
get_or_recommend_replica(context)
Gets or recommends a replica index for a request context.
-
Parameters:
-
context (TextGenerationContext) – The text generation context containing the request.
-
Returns:
-
Always returns 0, as the null cache manager operates in single-replica mode.
-
Return type:
get_replica()
get_replica(context)
Gets the replica index for a request context.
-
Parameters:
-
context (TextGenerationContext) – The text generation context containing the request.
-
Returns:
-
Always returns 0, as the null cache manager operates in single-replica mode.
-
Return type:
get_req_blocks()
get_req_blocks(request_id)
Get blocks for a request.
get_runtime_inputs()
get_runtime_inputs(batch, num_steps=1)
Fetch KV cache blocks (returns dummy tensors).
-
Parameters:
-
- batch (Sequence[TextGenerationContext]) – Batch of contexts
- num_steps (int) – Number of steps to fetch
-
Returns:
-
List containing a single RaggedKVCacheInputs with dummy tensors
-
Return type:
get_symbolic_inputs()
get_symbolic_inputs(devices=None, num_layers=None)
Get input symbols for graph construction.
-
Parameters:
-
Returns:
-
Sequence of PagedCacheInputSymbols for graph construction
-
Return type:
host_committed_block_pct
property host_committed_block_pct: float
Get percentage of host committed blocks.
-
Returns:
-
Always returns 0.0 (0%)
increment_cache_lengths()
increment_cache_lengths(kv_cache_inputs, prev_model_inputs)
Increment cache lengths (no-op for null manager).
-
Parameters:
-
- kv_cache_inputs (Sequence[RaggedKVCacheInputs]) – Current cache state tuples
- prev_model_inputs (Any) – Previous model inputs
-
Returns:
-
Unchanged cache inputs (no-op implementation)
-
Return type:
infer_optimal_batch_size()
classmethod infer_optimal_batch_size(params, max_seq_len, num_layers, available_cache_memory, devices, **kwargs)
Infer optimal batch size (returns 1 for null manager).
-
Parameters:
-
Returns:
-
Always returns 1
-
Return type:
metrics
property metrics: KVCacheMetrics
Get cache metrics.
-
Returns:
-
Current metrics
num_free_blocks
property num_free_blocks: int
Get number of free blocks.
-
Returns:
-
Dummy value of 1000
release()
release(request_id)
Release cache blocks (no-op for null manager).
-
Parameters:
-
request_id (RequestID) – Request ID to release
-
Return type:
-
None
reset_metrics()
reset_metrics()
Reset cache metrics.
-
Return type:
-
None
reset_prefix_cache()
reset_prefix_cache()
Reset prefix cache (no-op for null manager).
-
Return type:
-
None
step()
step(batch)
Step the cache manager (no-op for null manager).
-
Parameters:
-
batch (Sequence[TextGenerationContext]) – Batch of contexts
-
Return type:
-
None
total_num_host_pages
property total_num_host_pages: int
Get total number of host pages.
-
Returns:
-
Always returns 0
used_blocks_pct
property used_blocks_pct: float
Get percentage of used blocks.
-
Returns:
-
Always returns 0.0 (0%)
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!