Python class

TextContext

`TextContext`

class max.pipelines.TextContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, target_endpoint=None, external_block_metadata=None, cached_prefix_length=None)

source

Bases: object

A base class for model context, specifically for Text model variants.

This class manages the state and processing of text generation, including token management, caching, and generation parameters.

Parameters:

max_length (int) – Maximum allowed length of the generated sequence
tokens (TokenBuffer) – NumPy array containing the token IDs
request_id (RequestID) – A unique identifier for this sequence.
eos_tracker (EOSTracker) – holds EOS config and performs checks for EOS conditions
log_probabilities (int) – Whether to return token log probabilities
log_probabilities_echo (bool) – Whether to return log probabilities for prompt tokens
ignore_eos (bool) – Whether to ignore end of sequence tokens and continue generating
json_schema (str | None) – Optional JSON schema for structured output
sampling_params (SamplingParams) – Parameters controlling the token sampling strategy
model_name (str)
_matcher (Any | None)
status (GenerationStatus)
_log_probabilities_data (dict[int, LogProbabilities]) – Token log probabilities data
_is_initial_prompt (bool) – Whether this is the initial prompt encoding
_draft_offset (int) – Offset for draft decoding
_spec_decoding_state (SpecDecodingState | None) – Optional per-request speculative decoding state
target_endpoint (str | None) – Optional target endpoint identifier for routing requests
external_block_metadata (Any)
cached_prefix_length (int | None)

`advance_fsm()`

advance_fsm(token)

source

Advance the FSM matcher state by one token.

This method advances only the FSM state for constrained decoding. It does NOT modify the token buffer. Use advance_token_buffer() separately if token buffer advancement is needed, or use update() for the common case of advancing both together.

Parameters:: token (int) – The token to consume in the FSM.
Returns:: True if the token was accepted by the matcher, False if no matcher is present.
Raises:: AssertionError – If the matcher rejects the token, indicating a mismatch between the bitmask and FSM state.
Return type:: bool

`advance_token_buffer()`

advance_token_buffer(new_token, log_probabilities=None, mark_previous_as_processed=True)

source

Advance the token buffer without touching FSM state.

This method handles token buffer mutations including:

Chunked prefill advancement
Log probability storage
Token buffer advancement
EOS/max-length status updates

It does NOT advance the FSM matcher. Use advance_fsm() separately if FSM advancement is needed, or use update() for the common case of advancing both together.

Parameters:

new_token (int) – The token to append to the buffer.
log_probabilities (LogProbabilities | None) – Optional log probabilities for this token.
mark_previous_as_processed (bool) – If True, mark previous tokens as processed (standard behavior). If False, keep them unprocessed so they’re returned to the user (used for jump-ahead tokens).

Return type:

None

`apply_processing_offset()`

apply_processing_offset(offset)

source

Applies a processing offset to the token buffer.

Parameters:: offset (int)
Return type:: None

`cached_prefix_length`

cached_prefix_length: int | None = None

source

How many prompt tokens were served from the KV prefix cache.

Set by the block manager when a request is admitted to a CE batch (0 if the cache had no matching prefix). BatchMetrics.create consumes the value to emit a per-request cache hit rate observation, then resets it to None so chunked-prefill follow-up calls do not re-emit.

`compute_num_available_steps()`

compute_num_available_steps(max_seq_len)

source

Computes the maximum number of steps without exceeding max_seq_len.

Takes the current context length into account.

Parameters:: max_seq_len (int)
Return type:: int

`eos_tracker`

eos_tracker: EOSTracker

source

`external_block_metadata`

external_block_metadata: Any = None

source

Block metadata from the Orchestrator for distributed KV cache (dKV).

When set, the DKVConnector reads this during lookup() to determine which blocks are available in the external BlockStore system.

`get_min_token_logit_mask()`

get_min_token_logit_mask(num_steps)

source

Returns per-step masks for logits that should be masked (e.g. EOS during min_tokens).

This is primarily used for the min_tokens setting, where we mask EOS tokens in the logits to avoid generating them before we reach min_tokens.

Returns:: A list of arrays, one per step; each array has shape (N, 2) with (batch index, token ID) pairs for logits to mask.
Parameters:: num_steps (int)
Return type:: list[ndarray[tuple[Any, …], dtype[int32]]]

`ignore_eos`

ignore_eos: bool = False

source

`is_done`

property is_done: bool

source

Whether text generation has finished.

`is_initial_prompt`

property is_initial_prompt: bool

source

Returns true if the context has not been updated with tokens.

`json_schema`

json_schema: str | None = None

source

`jump_ahead()`

jump_ahead(new_token)

source

Advance both token buffer and FSM, keeping token visible to user.

Unlike update(), this method does not mark previous tokens as processed, so the new token will be included in the output returned to the user. This is used for grammar-forced tokens that the model didn’t generate but need to be part of the response.

Parameters:: new_token (int) – The forced token to append and consume.
Return type:: None

`log_probabilities`

log_probabilities: int = 0

source

`log_probabilities_echo`

log_probabilities_echo: bool = False

source

`matcher`

property matcher: LLMatcher | None

source

The optional grammar matcher for constrained decoding.

`max_length`

max_length: int

source

`min_tokens`

property min_tokens: int

source

The minimum number of new tokens to generate.

`model_name`

model_name: str = ''

source

`realize_future_token()`

realize_future_token(new_token, log_probabilities=None)

source

Overwrite the placeholder future token with the actual token.

This is primarily used for overlap scheduling.

Parameters:

new_token (int)
log_probabilities (LogProbabilities | None)

Return type:

None

`request_id`

request_id: RequestID

source

`reset()`

reset()

source

Resets the context’s state by combining all tokens into a new prompt.

Return type:: None

`sampling_params`

sampling_params: SamplingParams

source

`set_matcher()`

set_matcher(matcher)

source

Sets the grammar matcher for constrained decoding.

Parameters:: matcher (LLMatcher)
Return type:: None

`spec_decoding_state`

property spec_decoding_state: SpecDecodingState

source

Gets or creates the per-request speculative decoding state.

`status`

status: GenerationStatus = 'active'

source

`target_endpoint`

target_endpoint: str | None = None

source

`to_generation_output()`

to_generation_output()

source

Get completion tokens that are ready to be returned to the user.

This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.

Returns:: The completion tokens and their associated log probabilities, if available.
Return type:: TextGenerationOutput

`tokens`

tokens: TokenBuffer

source

`update()`

update(new_token, log_probabilities=None)

source

Advance both token buffer and FSM state.

This is the standard single-step update that most callers should use. It combines advance_token_buffer() and advance_fsm() for the common case where both need to be advanced together.

For multi-step execution where FSM is advanced separately (e.g., to compute bitmasks between steps), use the individual methods directly.

Parameters:

new_token (int) – The token to append and consume.
log_probabilities (LogProbabilities | None) – Optional log probabilities for this token.

Return type:

None

`update_with_future_token()`

update_with_future_token()

source

Append a placeholder future token to the generated tokens.

This is primarily used for overlap scheduling. For structured output contexts (those with a matcher), only the token buffer is advanced. The FSM will be advanced later when the future token is realized with the actual generated token.

Return type:: None

TextContext​

advance_fsm()​

advance_token_buffer()​

apply_processing_offset()​

cached_prefix_length​

compute_num_available_steps()​

eos_tracker​

external_block_metadata​

get_min_token_logit_mask()​

ignore_eos​

is_done​

is_initial_prompt​

json_schema​

jump_ahead()​

log_probabilities​

log_probabilities_echo​

matcher​

max_length​

min_tokens​

model_name​

realize_future_token()​

request_id​

reset()​

sampling_params​

set_matcher()​

spec_decoding_state​

status​

target_endpoint​

to_generation_output()​

tokens​

update()​

update_with_future_token()​

`TextContext`

`advance_fsm()`

`advance_token_buffer()`

`apply_processing_offset()`

`cached_prefix_length`

`compute_num_available_steps()`

`eos_tracker`

`external_block_metadata`

`get_min_token_logit_mask()`

`ignore_eos`

`is_done`

`is_initial_prompt`

`json_schema`

`jump_ahead()`

`log_probabilities`

`log_probabilities_echo`

`matcher`

`max_length`

`min_tokens`

`model_name`

`realize_future_token()`

`request_id`

`reset()`

`sampling_params`

`set_matcher()`

`spec_decoding_state`

`status`

`target_endpoint`

`to_generation_output()`

`tokens`

`update()`

`update_with_future_token()`