Python module
null_cache_manager
Null KV cache manager for compile-only mode.
This module provides a no-op KV cache manager that is used during compile-only mode when running with virtual devices. It avoids GPU memory allocation while still providing the necessary interface for graph construction.
NullKVCacheManager
class max.kv_cache.null_cache_manager.NullKVCacheManager(params, max_batch_size, max_seq_len, num_layers, devices, session, available_cache_memory, page_size=128)
A no-op KV cache manager for compile-only mode.
This manager is used when compiling for virtual devices and does not allocate any GPU memory. It provides dummy implementations of the KV cache interface to allow graph construction without actual memory allocation.
Initialize the null KV cache manager.
-
Parameters:
-
- params (KVCacheParams) – KV cache parameters
- max_batch_size (int) – Maximum batch size
- max_seq_len (int) – Maximum sequence length
- num_layers (int) – Number of model layers
- devices (Sequence[Device]) – List of devices
- session (InferenceSession) – Inference session
- available_cache_memory (int) – Available cache memory
- page_size (int) – Page size in tokens
contains()
contains(request_id)
Check if a request is in the cache.
estimated_memory_size()
classmethod estimated_memory_size(params, max_batch_size, max_seq_len, num_layers, available_cache_memory, devices, **kwargs)
Estimate memory size (returns 0 for null manager).
-
Parameters:
-
- params (KVCacheParams) – KV cache parameters
- max_batch_size (int) – Maximum batch size
- max_seq_len (int) – Maximum sequence length
- num_layers (int) – Number of layers
- available_cache_memory (int) – Available cache memory
- devices (Sequence[Device]) – List of devices
- **kwargs (Any) – Additional arguments
-
Returns:
-
Always returns 0 (no memory used)
-
Return type:
external_claim()
external_claim(request_id, replica_idx=None)
Externally claim cache blocks (no-op for null manager).
fetch()
fetch(batch, num_steps=1)
Fetch KV cache blocks (returns dummy tensors).
-
Parameters:
-
- batch (Sequence[TextGenerationContext]) – Batch of contexts
- num_steps (int) – Number of steps to fetch
-
Returns:
-
List containing a single RaggedKVCacheInputs with dummy tensors
-
Return type:
NOTE
Tensors are kept on host since this is only used in compile-only mode with virtual devices that don’t support device operations.
free_blocks_pct
property free_blocks_pct: float
Get percentage of free blocks.
-
Returns:
-
Always returns 1.0 (100%)
get_data_parallel_splits()
get_data_parallel_splits(batch)
Get data parallel splits for a batch.
-
Parameters:
-
batch (Sequence[TextGenerationContext]) – Batch of contexts
-
Returns:
-
Single split containing all batch indices
-
Return type:
get_or_recommend_replica()
get_or_recommend_replica(context)
Get or recommend a replica index for a context.
-
Parameters:
-
context (TextGenerationContext) – Text generation context
-
Returns:
-
Always returns 0 (single replica)
-
Return type:
get_replica()
get_replica(context)
Get the replica index for a context.
-
Parameters:
-
context (TextGenerationContext) – Text generation context
-
Returns:
-
Always returns 0 (single replica)
-
Return type:
get_req_blocks()
get_req_blocks(request_id)
Get blocks for a request.
host_committed_block_pct
property host_committed_block_pct: float
Get percentage of host committed blocks.
-
Returns:
-
Always returns 0.0 (0%)
increment_cache_lengths()
increment_cache_lengths(kv_cache_inputs, prev_model_inputs)
Increment cache lengths (no-op for null manager).
-
Parameters:
-
- kv_cache_inputs (Sequence[RaggedKVCacheInputs]) – Current cache state tuples
- prev_model_inputs (Any) – Previous model inputs
-
Returns:
-
Unchanged cache inputs (no-op implementation)
-
Return type:
infer_optimal_batch_size()
classmethod infer_optimal_batch_size(params, max_seq_len, num_layers, available_cache_memory, devices, **kwargs)
Infer optimal batch size (returns 1 for null manager).
-
Parameters:
-
Returns:
-
Always returns 1
-
Return type:
input_symbols()
input_symbols(devices=None, num_layers=None)
Get input symbols for graph construction.
-
Parameters:
-
Returns:
-
Sequence of PagedCacheInputSymbols for graph construction
-
Return type:
maybe_reserve()
maybe_reserve(data, num_steps=1)
Reserve cache blocks (no-op for null manager).
-
Parameters:
-
- data (TextGenerationContext) – Text generation context
- num_steps (int) – Number of steps to reserve
-
Returns:
-
Always returns True
-
Return type:
metrics
property metrics: KVCacheMetrics
Get cache metrics.
-
Returns:
-
Current metrics
num_free_blocks
property num_free_blocks: int
Get number of free blocks.
-
Returns:
-
Dummy value of 1000
release()
release(request_id)
Release cache blocks (no-op for null manager).
-
Parameters:
-
request_id (RequestID) – Request ID to release
-
Return type:
-
None
reset_metrics()
reset_metrics()
Reset cache metrics.
-
Return type:
-
None
reset_prefix_cache()
reset_prefix_cache()
Reset prefix cache (no-op for null manager).
-
Return type:
-
None
step()
step(batch)
Step the cache manager (no-op for null manager).
-
Parameters:
-
batch (Sequence[TextGenerationContext]) – Batch of contexts
-
Return type:
-
None
total_num_host_pages
property total_num_host_pages: int
Get total number of host pages.
-
Returns:
-
Always returns 0
used_blocks_pct
property used_blocks_pct: float
Get percentage of used blocks.
-
Returns:
-
Always returns 0.0 (0%)
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!