Python module

null_cache_manager

Null KV cache manager for compile-only mode.

This module provides a no-op KV cache manager that is used during compile-only mode when running with virtual devices. It avoids GPU memory allocation while still providing the necessary interface for graph construction.

`NullKVCacheManager`

class max.kv_cache.null_cache_manager.NullKVCacheManager(params)

A no-op KV cache manager for compile-only mode.

This manager is used when compiling models with virtual devices and does not allocate any GPU memory. It provides dummy implementations of the KV cache interface to allow graph construction and compilation without requiring physical GPU hardware or actual memory allocation.

This is particularly useful for cross-compilation scenarios where you want to compile models for GPU execution on a machine without a physical GPU present.

Initializes the null KV cache manager.

Parameters:

params (KVCacheParams) – The KV cache parameters for the pipeline.
session – The inference session for graph operations.

`alloc()`

alloc(data, num_steps=1)

Allocates blocks for a request to run for N steps.

Parameters:

data (TextGenerationContext)
num_steps (int)

Return type:

None

`claim()`

claim(request_id, replica_idx=None)

Externally claim cache blocks (no-op for null manager).

Parameters:

request_id (RequestID) – Request ID
replica_idx (int | None) – Replica index (defaults to 0 if None)

Return type:

None

`contains()`

contains(request_id)

Check if a request is in the cache.

Parameters:: request_id (RequestID) – Request ID to check
Returns:: True if request is tracked, False otherwise
Return type:: bool

`free_blocks_pct`

property free_blocks_pct: float

Get percentage of free blocks.

Returns:: Always returns 1.0 (100%)

`get_data_parallel_splits()`

get_data_parallel_splits(batch)

Get data parallel splits for a batch.

Parameters:: batch (Sequence[TextGenerationContext]) – Batch of contexts
Returns:: Single split containing all batch indices
Return type:: Sequence[Sequence[int]]

`get_or_recommend_replica()`

get_or_recommend_replica(context)

Gets or recommends a replica index for a request context.

Parameters:: context (TextGenerationContext) – The text generation context containing the request.
Returns:: Always returns 0, as the null cache manager operates in single-replica mode.
Return type:: int

`get_replica()`

get_replica(request_id)

Gets the replica index for a request context.

Parameters:: request_id (RequestID) – The request ID to get the replica for.
Returns:: Always returns 0, as the null cache manager operates in single-replica mode.
Return type:: int

`get_replica_request_count()`

get_replica_request_count(replica_idx)

Get the number of active requests for a replica.

Parameters:: replica_idx (int) – The replica index to query
Returns:: Always returns 0 for null cache manager (compile-only mode)
Return type:: int

`get_req_blocks()`

get_req_blocks(request_id)

Get blocks for a request.

Parameters:: request_id (RequestID) – Request ID
Returns:: Empty list (no blocks allocated)
Return type:: list[int]

`get_runtime_inputs()`

get_runtime_inputs(batch, num_steps=1)

Fetch KV cache blocks (returns dummy tensors).

Parameters:

batch (Sequence[TextGenerationContext]) – Batch of contexts
num_steps (int) – Number of steps to fetch

Returns:

List containing a single RaggedKVCacheInputs with dummy tensors

Return type:

list[RaggedKVCacheInputs]

Note

Tensors are kept on host since this is only used in compile-only mode with virtual devices that don’t support device operations.

`host_committed_block_pct`

property host_committed_block_pct: float

Get percentage of host committed blocks.

Returns:: Always returns 0.0 (0%)

`increment_cache_lengths()`

increment_cache_lengths(kv_cache_inputs, prev_model_inputs)

Increment cache lengths (no-op for null manager).

Parameters:

kv_cache_inputs (Sequence[RaggedKVCacheInputs]) – Current cache state tuples
prev_model_inputs (Any) – Previous model inputs

Returns:

Unchanged cache inputs (no-op implementation)

Return type:

Sequence[RaggedKVCacheInputs]

`metrics`

property metrics: KVCacheMetrics

Get cache metrics.

Returns:: Current metrics

`num_free_blocks`

property num_free_blocks: int

Get number of free blocks.

Returns:: Dummy value of 1000

`release()`

release(request_id)

Release cache blocks (no-op for null manager).

Parameters:: request_id (RequestID) – Request ID to release
Return type:: None

`reset_metrics()`

reset_metrics()

Reset cache metrics.

Return type:: None

`reset_prefix_cache()`

reset_prefix_cache()

Reset prefix cache (no-op for null manager).

Return type:: None

`step()`

step(batch)

Step the cache manager (no-op for null manager).

Parameters:: batch (Sequence[TextGenerationContext]) – Batch of contexts
Return type:: None

`total_num_host_pages`

property total_num_host_pages: int

Get total number of host pages.

Returns:: Always returns 0

`used_blocks_pct`

property used_blocks_pct: float

Get percentage of used blocks.

Returns:: Always returns 0.0 (0%)

NullKVCacheManager​

alloc()​

claim()​

contains()​

free_blocks_pct​

get_data_parallel_splits()​

get_or_recommend_replica()​

get_replica()​

get_replica_request_count()​

get_req_blocks()​

get_runtime_inputs()​

host_committed_block_pct​

increment_cache_lengths()​

metrics​

num_free_blocks​

release()​

reset_metrics()​

reset_prefix_cache()​

step()​

total_num_host_pages​

used_blocks_pct​

`NullKVCacheManager`

`alloc()`

`claim()`

`contains()`

`free_blocks_pct`

`get_data_parallel_splits()`

`get_or_recommend_replica()`

`get_replica()`

`get_replica_request_count()`

`get_req_blocks()`

`get_runtime_inputs()`

`host_committed_block_pct`

`increment_cache_lengths()`

`metrics`

`num_free_blocks`

`release()`

`reset_metrics()`

`reset_prefix_cache()`

`step()`

`total_num_host_pages`

`used_blocks_pct`