Skip to main content

Python class

TextContext

TextContext

class max.pipelines.TextContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, target_endpoint=None, external_block_metadata=None, cached_prefix_length=None)

source

Bases: object

A base class for model context, specifically for Text model variants.

This class manages the state and processing of text generation, including token management, caching, and generation parameters.

Parameters:

  • max_length (int) – Maximum allowed length of the generated sequence
  • tokens (TokenBuffer) – NumPy array containing the token IDs
  • request_id (RequestID) – A unique identifier for this sequence.
  • eos_tracker (EOSTracker) – holds EOS config and performs checks for EOS conditions
  • log_probabilities (int) – Whether to return token log probabilities
  • log_probabilities_echo (bool) – Whether to return log probabilities for prompt tokens
  • ignore_eos (bool) – Whether to ignore end of sequence tokens and continue generating
  • json_schema (str | None) – Optional JSON schema for structured output
  • sampling_params (SamplingParams) – Parameters controlling the token sampling strategy
  • model_name (str)
  • _matcher (Any | None)
  • status (GenerationStatus)
  • _log_probabilities_data (dict[int, LogProbabilities]) – Token log probabilities data
  • _is_initial_prompt (bool) – Whether this is the initial prompt encoding
  • _draft_offset (int) – Offset for draft decoding
  • _spec_decoding_state (SpecDecodingState | None) – Optional per-request speculative decoding state
  • target_endpoint (str | None) – Optional target endpoint identifier for routing requests
  • external_block_metadata (Any)
  • cached_prefix_length (int | None)

advance_fsm()

advance_fsm(token)

source

Advance the FSM matcher state by one token.

This method advances only the FSM state for constrained decoding. It does NOT modify the token buffer. Use advance_token_buffer() separately if token buffer advancement is needed, or use update() for the common case of advancing both together.

Parameters:

token (int) – The token to consume in the FSM.

Returns:

True if the token was accepted by the matcher, False if no matcher is present.

Raises:

AssertionError – If the matcher rejects the token, indicating a mismatch between the bitmask and FSM state.

Return type:

bool

advance_token_buffer()

advance_token_buffer(new_token, log_probabilities=None, mark_previous_as_processed=True)

source

Advance the token buffer without touching FSM state.

This method handles token buffer mutations including:

  • Chunked prefill advancement
  • Log probability storage
  • Token buffer advancement
  • EOS/max-length status updates

It does NOT advance the FSM matcher. Use advance_fsm() separately if FSM advancement is needed, or use update() for the common case of advancing both together.

Parameters:

  • new_token (int) – The token to append to the buffer.
  • log_probabilities (LogProbabilities | None) – Optional log probabilities for this token.
  • mark_previous_as_processed (bool) – If True, mark previous tokens as processed (standard behavior). If False, keep them unprocessed so they’re returned to the user (used for jump-ahead tokens).

Return type:

None

apply_processing_offset()

apply_processing_offset(offset)

source

Applies a processing offset to the token buffer.

Parameters:

offset (int)

Return type:

None

cached_prefix_length

cached_prefix_length: int | None = None

source

How many prompt tokens were served from the KV prefix cache.

Set by the block manager when a request is admitted to a CE batch (0 if the cache had no matching prefix). BatchMetrics.create consumes the value to emit a per-request cache hit rate observation, then resets it to None so chunked-prefill follow-up calls do not re-emit.

compute_num_available_steps()

compute_num_available_steps(max_seq_len)

source

Computes the maximum number of steps without exceeding max_seq_len.

Takes the current context length into account.

Parameters:

max_seq_len (int)

Return type:

int

eos_tracker

eos_tracker: EOSTracker

source

external_block_metadata

external_block_metadata: Any = None

source

Block metadata from the Orchestrator for distributed KV cache (dKV).

When set, the DKVConnector reads this during lookup() to determine which blocks are available in the external BlockStore system.

get_min_token_logit_mask()

get_min_token_logit_mask(num_steps)

source

Returns per-step masks for logits that should be masked (e.g. EOS during min_tokens).

This is primarily used for the min_tokens setting, where we mask EOS tokens in the logits to avoid generating them before we reach min_tokens.

Returns:

A list of arrays, one per step; each array has shape (N, 2) with (batch index, token ID) pairs for logits to mask.

Parameters:

num_steps (int)

Return type:

list[ndarray[tuple[Any, …], dtype[int32]]]

ignore_eos

ignore_eos: bool = False

source

is_done

property is_done: bool

source

Whether text generation has finished.

is_initial_prompt

property is_initial_prompt: bool

source

Returns true if the context has not been updated with tokens.

json_schema

json_schema: str | None = None

source

jump_ahead()

jump_ahead(new_token)

source

Advance both token buffer and FSM, keeping token visible to user.

Unlike update(), this method does not mark previous tokens as processed, so the new token will be included in the output returned to the user. This is used for grammar-forced tokens that the model didn’t generate but need to be part of the response.

Parameters:

new_token (int) – The forced token to append and consume.

Return type:

None

log_probabilities

log_probabilities: int = 0

source

log_probabilities_echo

log_probabilities_echo: bool = False

source

matcher

property matcher: LLMatcher | None

source

The optional grammar matcher for constrained decoding.

max_length

max_length: int

source

min_tokens

property min_tokens: int

source

The minimum number of new tokens to generate.

model_name

model_name: str = ''

source

realize_future_token()

realize_future_token(new_token, log_probabilities=None)

source

Overwrite the placeholder future token with the actual token.

This is primarily used for overlap scheduling.

Parameters:

Return type:

None

request_id

request_id: RequestID

source

reset()

reset()

source

Resets the context’s state by combining all tokens into a new prompt.

Return type:

None

sampling_params

sampling_params: SamplingParams

source

set_matcher()

set_matcher(matcher)

source

Sets the grammar matcher for constrained decoding.

Parameters:

matcher (LLMatcher)

Return type:

None

spec_decoding_state

property spec_decoding_state: SpecDecodingState

source

Gets or creates the per-request speculative decoding state.

status

status: GenerationStatus = 'active'

source

target_endpoint

target_endpoint: str | None = None

source

to_generation_output()

to_generation_output()

source

Get completion tokens that are ready to be returned to the user.

This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.

Returns:

The completion tokens and their associated log probabilities, if available.

Return type:

TextGenerationOutput

tokens

tokens: TokenBuffer

source

update()

update(new_token, log_probabilities=None)

source

Advance both token buffer and FSM state.

This is the standard single-step update that most callers should use. It combines advance_token_buffer() and advance_fsm() for the common case where both need to be advanced together.

For multi-step execution where FSM is advanced separately (e.g., to compute bitmasks between steps), use the individual methods directly.

Parameters:

  • new_token (int) – The token to append and consume.
  • log_probabilities (LogProbabilities | None) – Optional log probabilities for this token.

Return type:

None

update_with_future_token()

update_with_future_token()

source

Append a placeholder future token to the generated tokens.

This is primarily used for overlap scheduling. For structured output contexts (those with a matcher), only the token buffer is advanced. The FSM will be advanced later when the future token is realized with the actual generated token.

Return type:

None