Skip to main content

Python module

null_cache_manager

Null KV cache manager for compile-only mode.

This module provides a no-op KV cache manager that is used during compile-only mode when running with virtual devices. It avoids GPU memory allocation while still providing the necessary interface for graph construction.

NullKVCacheManager

class max.kv_cache.null_cache_manager.NullKVCacheManager(params, max_batch_size, max_seq_len, num_layers, devices, session, available_cache_memory, page_size=128)

A no-op KV cache manager for compile-only mode.

This manager is used when compiling models with virtual devices and does not allocate any GPU memory. It provides dummy implementations of the KV cache interface to allow graph construction and compilation without requiring physical GPU hardware or actual memory allocation.

This is particularly useful for cross-compilation scenarios where you want to compile models for GPU execution on a machine without a physical GPU present.

Initializes the null KV cache manager.

Parameters:

  • params (KVCacheParams) – The KV cache parameters for the pipeline.
  • max_batch_size (int) – The maximum batch size to support.
  • max_seq_len (int) – The maximum sequence length to support.
  • num_layers (int) – The number of transformer layers in the model.
  • devices (Sequence[Device]) – The list of virtual devices.
  • session (InferenceSession) – The inference session for graph operations.
  • available_cache_memory (int) – The nominal available cache memory in bytes.
  • page_size (int) – The page size in tokens. Defaults to 128.

alloc()

alloc(data, num_steps=1)

Allocates blocks for a request to run for N steps.

Parameters:

Return type:

None

claim()

claim(request_id, replica_idx=None)

Externally claim cache blocks (no-op for null manager).

Parameters:

  • request_id (RequestID) – Request ID
  • replica_idx (int | None) – Replica index (defaults to 0 if None)

Return type:

None

contains()

contains(request_id)

Check if a request is in the cache.

Parameters:

request_id (RequestID) – Request ID to check

Returns:

True if request is tracked, False otherwise

Return type:

bool

estimated_memory_size()

classmethod estimated_memory_size(params, max_batch_size, max_seq_len, num_layers, available_cache_memory, devices, **kwargs)

Estimate memory size (returns 0 for null manager).

Parameters:

  • params (KVCacheParams) – KV cache parameters
  • max_batch_size (int) – Maximum batch size
  • max_seq_len (int) – Maximum sequence length
  • num_layers (int) – Number of layers
  • available_cache_memory (int) – Available cache memory
  • devices (Sequence[Device]) – List of devices
  • **kwargs (Any) – Additional arguments

Returns:

Always returns 0 (no memory used)

Return type:

int

free_blocks_pct

property free_blocks_pct: float

Get percentage of free blocks.

Returns:

Always returns 1.0 (100%)

get_data_parallel_splits()

get_data_parallel_splits(batch)

Get data parallel splits for a batch.

Parameters:

batch (Sequence[TextGenerationContext]) – Batch of contexts

Returns:

Single split containing all batch indices

Return type:

Sequence[Sequence[int]]

get_or_recommend_replica()

get_or_recommend_replica(context)

Gets or recommends a replica index for a request context.

Parameters:

context (TextGenerationContext) – The text generation context containing the request.

Returns:

Always returns 0, as the null cache manager operates in single-replica mode.

Return type:

int

get_replica()

get_replica(context)

Gets the replica index for a request context.

Parameters:

context (TextGenerationContext) – The text generation context containing the request.

Returns:

Always returns 0, as the null cache manager operates in single-replica mode.

Return type:

int

get_req_blocks()

get_req_blocks(request_id)

Get blocks for a request.

Parameters:

request_id (RequestID) – Request ID

Returns:

Empty list (no blocks allocated)

Return type:

list[int]

get_runtime_inputs()

get_runtime_inputs(batch, num_steps=1)

Fetch KV cache blocks (returns dummy tensors).

Parameters:

Returns:

List containing a single RaggedKVCacheInputs with dummy tensors

Return type:

list[RaggedKVCacheInputs]

get_symbolic_inputs()

get_symbolic_inputs(devices=None, num_layers=None)

Get input symbols for graph construction.

Parameters:

  • devices (Sequence[Device] | None) – Devices to use (defaults to self.devices)
  • num_layers (int | None) – Number of layers (defaults to self.num_layers)

Returns:

Sequence of PagedCacheInputSymbols for graph construction

Return type:

Sequence[PagedCacheInputSymbols]

host_committed_block_pct

property host_committed_block_pct: float

Get percentage of host committed blocks.

Returns:

Always returns 0.0 (0%)

increment_cache_lengths()

increment_cache_lengths(kv_cache_inputs, prev_model_inputs)

Increment cache lengths (no-op for null manager).

Parameters:

Returns:

Unchanged cache inputs (no-op implementation)

Return type:

Sequence[RaggedKVCacheInputs]

infer_optimal_batch_size()

classmethod infer_optimal_batch_size(params, max_seq_len, num_layers, available_cache_memory, devices, **kwargs)

Infer optimal batch size (returns 1 for null manager).

Parameters:

  • params (KVCacheParams) – KV cache parameters
  • max_seq_len (int) – Maximum sequence length
  • num_layers (int) – Number of layers
  • available_cache_memory (int) – Available cache memory
  • devices (Sequence[Device]) – List of devices
  • **kwargs (Any) – Additional arguments

Returns:

Always returns 1

Return type:

int

metrics

property metrics: KVCacheMetrics

Get cache metrics.

Returns:

Current metrics

num_free_blocks

property num_free_blocks: int

Get number of free blocks.

Returns:

Dummy value of 1000

release()

release(request_id)

Release cache blocks (no-op for null manager).

Parameters:

request_id (RequestID) – Request ID to release

Return type:

None

reset_metrics()

reset_metrics()

Reset cache metrics.

Return type:

None

reset_prefix_cache()

reset_prefix_cache()

Reset prefix cache (no-op for null manager).

Return type:

None

step()

step(batch)

Step the cache manager (no-op for null manager).

Parameters:

batch (Sequence[TextGenerationContext]) – Batch of contexts

Return type:

None

total_num_host_pages

property total_num_host_pages: int

Get total number of host pages.

Returns:

Always returns 0

used_blocks_pct

property used_blocks_pct: float

Get percentage of used blocks.

Returns:

Always returns 0.0 (0%)

Was this page helpful?