Skip to main content

Python module

context

KVCacheAwareContext

class max.nn.kv_cache.context.KVCacheAwareContext(*args, **kwargs)

A Protocol identifying the minimum API necessary for interacting with a KV Cache.

active_idx

property active_idx: int

active_length

property active_length: int

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 for token generation.

Type:

Current sequence length

assign_to_cache()

assign_to_cache(cache_seq_id)

Assigns the context to a cache slot.

Parameters:

cache_seq_id (int)

Return type:

None

bump_token_indices()

bump_token_indices(start_idx=0, active_idx=0, end_idx=0, committed_idx=0)

Update the start_idx, active_idx and end_idx without manipulating the token array.

Parameters:

  • start_idx (int)
  • active_idx (int)
  • end_idx (int)
  • committed_idx (int)

Return type:

None

cache_seq_id

property cache_seq_id: int

Returns the cache slot assigned to the context, raising an error if not assigned.

committed_idx

property committed_idx: int

compute_num_available_steps()

compute_num_available_steps(max_seq_len)

Compute the max number of steps we can execute for a given context without exceeding the max_seq_len.

Parameters:

max_seq_len (int)

Return type:

int

current_length

property current_length: int

The current length of the sequence, including completed and active tokens.

end_idx

property end_idx: int

eos_token_ids

property eos_token_ids: set[int]

is_assigned_to_cache

property is_assigned_to_cache: bool

Returns True if input is assigned to a cache slot, False otherwise.

is_done

property is_done: bool

json_schema

property json_schema: str | None

A json schema to use during constrained decoding.

matcher

property matcher: xgr.GrammarMatcher | None

An optional xgr Grammar Matcher provided when using structured output.

max_length

property max_length: int | None

The maximum length of this sequence.

next_tokens

property next_tokens: ndarray

The next prompt tokens to be input during this iteration.

This should be a 1D array of tokens of length active_length.

reset()

reset()

Resets the context’s state by combining all tokens into a new prompt. This method is used when a request is evicted, meaning that the context needed to be re-encoded in the following CE iteration.

Return type:

None

set_matcher()

set_matcher(matcher)

Set a grammar matcher for use during constrained decoding.

Parameters:

matcher (xgr.GrammarMatcher)

Return type:

None

set_token_indices()

set_token_indices(start_idx=None, active_idx=None, end_idx=None, committed_idx=None)

Set the token indices without manipulating the token array.

Parameters:

  • start_idx (int | None)
  • active_idx (int | None)
  • end_idx (int | None)
  • committed_idx (int | None)

Return type:

None

start_idx

property start_idx: int

status

property status: GenerationStatus

tokens

property tokens: ndarray

All tokens in the context.

unassign_from_cache()

unassign_from_cache()

Unassigns the context from a cache slot.

Return type:

None

update_status()

update_status(status)

Parameters:

status (GenerationStatus)

Return type:

None