Skip to main content

Python module

null_cache_manager

Null KV cache manager for compile-only mode.

This module provides a no-op KV cache manager that is used during compile-only mode when running with virtual devices. It avoids GPU memory allocation while still providing the necessary interface for graph construction.

NullKVCacheManager

class max.kv_cache.null_cache_manager.NullKVCacheManager(params, max_batch_size, max_seq_len, num_layers, devices, session, available_cache_memory, page_size=128)

A no-op KV cache manager for compile-only mode.

This manager is used when compiling for virtual devices and does not allocate any GPU memory. It provides dummy implementations of the KV cache interface to allow graph construction without actual memory allocation.

Initialize the null KV cache manager.

Parameters:

  • params (KVCacheParams) – KV cache parameters
  • max_batch_size (int) – Maximum batch size
  • max_seq_len (int) – Maximum sequence length
  • num_layers (int) – Number of model layers
  • devices (Sequence[Device]) – List of devices
  • session (InferenceSession) – Inference session
  • available_cache_memory (int) – Available cache memory
  • page_size (int) – Page size in tokens

contains()

contains(request_id)

Check if a request is in the cache.

Parameters:

request_id (RequestID) – Request ID to check

Returns:

True if request is tracked, False otherwise

Return type:

bool

estimated_memory_size()

classmethod estimated_memory_size(params, max_batch_size, max_seq_len, num_layers, available_cache_memory, devices, **kwargs)

Estimate memory size (returns 0 for null manager).

Parameters:

  • params (KVCacheParams) – KV cache parameters
  • max_batch_size (int) – Maximum batch size
  • max_seq_len (int) – Maximum sequence length
  • num_layers (int) – Number of layers
  • available_cache_memory (int) – Available cache memory
  • devices (Sequence[Device]) – List of devices
  • **kwargs (Any) – Additional arguments

Returns:

Always returns 0 (no memory used)

Return type:

int

external_claim()

external_claim(request_id, replica_idx=None)

Externally claim cache blocks (no-op for null manager).

Parameters:

  • request_id (RequestID) – Request ID
  • replica_idx (int | None) – Replica index (defaults to 0 if None)

Return type:

None

fetch()

fetch(batch, num_steps=1)

Fetch KV cache blocks (returns dummy tensors).

Parameters:

Returns:

List containing a single RaggedKVCacheInputs with dummy tensors

Return type:

list[RaggedKVCacheInputs]

NOTE

Tensors are kept on host since this is only used in compile-only mode with virtual devices that don’t support device operations.

free_blocks_pct

property free_blocks_pct: float

Get percentage of free blocks.

Returns:

Always returns 1.0 (100%)

get_data_parallel_splits()

get_data_parallel_splits(batch)

Get data parallel splits for a batch.

Parameters:

batch (Sequence[TextGenerationContext]) – Batch of contexts

Returns:

Single split containing all batch indices

Return type:

Sequence[Sequence[int]]

get_or_recommend_replica()

get_or_recommend_replica(context)

Get or recommend a replica index for a context.

Parameters:

context (TextGenerationContext) – Text generation context

Returns:

Always returns 0 (single replica)

Return type:

int

get_replica()

get_replica(context)

Get the replica index for a context.

Parameters:

context (TextGenerationContext) – Text generation context

Returns:

Always returns 0 (single replica)

Return type:

int

get_req_blocks()

get_req_blocks(request_id)

Get blocks for a request.

Parameters:

request_id (RequestID) – Request ID

Returns:

Empty list (no blocks allocated)

Return type:

list[int]

host_committed_block_pct

property host_committed_block_pct: float

Get percentage of host committed blocks.

Returns:

Always returns 0.0 (0%)

increment_cache_lengths()

increment_cache_lengths(kv_cache_inputs, prev_model_inputs)

Increment cache lengths (no-op for null manager).

Parameters:

Returns:

Unchanged cache inputs (no-op implementation)

Return type:

Sequence[RaggedKVCacheInputs]

infer_optimal_batch_size()

classmethod infer_optimal_batch_size(params, max_seq_len, num_layers, available_cache_memory, devices, **kwargs)

Infer optimal batch size (returns 1 for null manager).

Parameters:

  • params (KVCacheParams) – KV cache parameters
  • max_seq_len (int) – Maximum sequence length
  • num_layers (int) – Number of layers
  • available_cache_memory (int) – Available cache memory
  • devices (Sequence[Device]) – List of devices
  • **kwargs (Any) – Additional arguments

Returns:

Always returns 1

Return type:

int

input_symbols()

input_symbols(devices=None, num_layers=None)

Get input symbols for graph construction.

Parameters:

  • devices (Sequence[Device] | None) – Devices to use (defaults to self.devices)
  • num_layers (int | None) – Number of layers (defaults to self.num_layers)

Returns:

Sequence of PagedCacheInputSymbols for graph construction

Return type:

Sequence[PagedCacheInputSymbols]

maybe_reserve()

maybe_reserve(data, num_steps=1)

Reserve cache blocks (no-op for null manager).

Parameters:

Returns:

Always returns True

Return type:

bool

metrics

property metrics: KVCacheMetrics

Get cache metrics.

Returns:

Current metrics

num_free_blocks

property num_free_blocks: int

Get number of free blocks.

Returns:

Dummy value of 1000

release()

release(request_id)

Release cache blocks (no-op for null manager).

Parameters:

request_id (RequestID) – Request ID to release

Return type:

None

reset_metrics()

reset_metrics()

Reset cache metrics.

Return type:

None

reset_prefix_cache()

reset_prefix_cache()

Reset prefix cache (no-op for null manager).

Return type:

None

step()

step(batch)

Step the cache manager (no-op for null manager).

Parameters:

batch (Sequence[TextGenerationContext]) – Batch of contexts

Return type:

None

total_num_host_pages

property total_num_host_pages: int

Get total number of host pages.

Returns:

Always returns 0

used_blocks_pct

property used_blocks_pct: float

Get percentage of used blocks.

Returns:

Always returns 0.0 (0%)

Was this page helpful?