Python class
TextContext
TextContext
class max.pipelines.TextContext(*, max_length, tokens, request_id=<factory>, eos_tracker=<factory>, log_probabilities=0, log_probabilities_echo=False, ignore_eos=False, json_schema=None, sampling_params=<factory>, model_name='', _matcher=None, status=GenerationStatus.ACTIVE, _log_probabilities_data=<factory>, _is_initial_prompt=True, _draft_offset=0, _spec_decoding_state=None, target_endpoint=None, external_block_metadata=None, cached_prefix_length=None)
Bases: object
A base class for model context, specifically for Text model variants.
This class manages the state and processing of text generation, including token management, caching, and generation parameters.
-
Parameters:
-
- max_length (int) – Maximum allowed length of the generated sequence
- tokens (TokenBuffer) – NumPy array containing the token IDs
- request_id (RequestID) – A unique identifier for this sequence.
- eos_tracker (EOSTracker) – holds EOS config and performs checks for EOS conditions
- log_probabilities (int) – Whether to return token log probabilities
- log_probabilities_echo (bool) – Whether to return log probabilities for prompt tokens
- ignore_eos (bool) – Whether to ignore end of sequence tokens and continue generating
- json_schema (str | None) – Optional JSON schema for structured output
- sampling_params (SamplingParams) – Parameters controlling the token sampling strategy
- model_name (str)
- _matcher (Any | None)
- status (GenerationStatus)
- _log_probabilities_data (dict[int, LogProbabilities]) – Token log probabilities data
- _is_initial_prompt (bool) – Whether this is the initial prompt encoding
- _draft_offset (int) – Offset for draft decoding
- _spec_decoding_state (SpecDecodingState | None) – Optional per-request speculative decoding state
- target_endpoint (str | None) – Optional target endpoint identifier for routing requests
- external_block_metadata (Any)
- cached_prefix_length (int | None)
advance_fsm()
advance_fsm(token)
Advance the FSM matcher state by one token.
This method advances only the FSM state for constrained decoding.
It does NOT modify the token buffer. Use advance_token_buffer()
separately if token buffer advancement is needed, or use update()
for the common case of advancing both together.
-
Parameters:
-
token (int) – The token to consume in the FSM.
-
Returns:
-
True if the token was accepted by the matcher, False if no matcher is present.
-
Raises:
-
AssertionError – If the matcher rejects the token, indicating a mismatch between the bitmask and FSM state.
-
Return type:
advance_token_buffer()
advance_token_buffer(new_token, log_probabilities=None, mark_previous_as_processed=True)
Advance the token buffer without touching FSM state.
This method handles token buffer mutations including:
- Chunked prefill advancement
- Log probability storage
- Token buffer advancement
- EOS/max-length status updates
It does NOT advance the FSM matcher. Use advance_fsm() separately
if FSM advancement is needed, or use update() for the common case
of advancing both together.
-
Parameters:
-
- new_token (int) – The token to append to the buffer.
- log_probabilities (LogProbabilities | None) – Optional log probabilities for this token.
- mark_previous_as_processed (bool) – If True, mark previous tokens as processed (standard behavior). If False, keep them unprocessed so they’re returned to the user (used for jump-ahead tokens).
-
Return type:
-
None
apply_processing_offset()
apply_processing_offset(offset)
Applies a processing offset to the token buffer.
-
Parameters:
-
offset (int)
-
Return type:
-
None
cached_prefix_length
How many prompt tokens were served from the KV prefix cache.
Set by the block manager when a request is admitted to a CE batch
(0 if the cache had no matching prefix). BatchMetrics.create
consumes the value to emit a per-request cache hit rate observation,
then resets it to None so chunked-prefill follow-up calls do not
re-emit.
compute_num_available_steps()
compute_num_available_steps(max_seq_len)
Computes the maximum number of steps without exceeding max_seq_len.
Takes the current context length into account.
eos_tracker
eos_tracker: EOSTracker
external_block_metadata
external_block_metadata: Any = None
Block metadata from the Orchestrator for distributed KV cache (dKV).
When set, the DKVConnector reads this during lookup() to determine which blocks are available in the external BlockStore system.
get_min_token_logit_mask()
get_min_token_logit_mask(num_steps)
Returns per-step masks for logits that should be masked (e.g. EOS during min_tokens).
This is primarily used for the min_tokens setting, where we mask
EOS tokens in the logits to avoid generating them before we reach
min_tokens.
ignore_eos
ignore_eos: bool = False
is_done
property is_done: bool
Whether text generation has finished.
is_initial_prompt
property is_initial_prompt: bool
Returns true if the context has not been updated with tokens.
json_schema
jump_ahead()
jump_ahead(new_token)
Advance both token buffer and FSM, keeping token visible to user.
Unlike update(), this method does not mark previous tokens as
processed, so the new token will be included in the output returned
to the user. This is used for grammar-forced tokens that the model
didn’t generate but need to be part of the response.
-
Parameters:
-
new_token (int) – The forced token to append and consume.
-
Return type:
-
None
log_probabilities
log_probabilities: int = 0
log_probabilities_echo
log_probabilities_echo: bool = False
matcher
property matcher: LLMatcher | None
The optional grammar matcher for constrained decoding.
max_length
max_length: int
min_tokens
property min_tokens: int
The minimum number of new tokens to generate.
model_name
model_name: str = ''
realize_future_token()
realize_future_token(new_token, log_probabilities=None)
Overwrite the placeholder future token with the actual token.
This is primarily used for overlap scheduling.
-
Parameters:
-
- new_token (int)
- log_probabilities (LogProbabilities | None)
-
Return type:
-
None
request_id
request_id: RequestID
reset()
reset()
Resets the context’s state by combining all tokens into a new prompt.
-
Return type:
-
None
sampling_params
sampling_params: SamplingParams
set_matcher()
set_matcher(matcher)
Sets the grammar matcher for constrained decoding.
-
Parameters:
-
matcher (LLMatcher)
-
Return type:
-
None
spec_decoding_state
property spec_decoding_state: SpecDecodingState
Gets or creates the per-request speculative decoding state.
status
status: GenerationStatus = 'active'
target_endpoint
to_generation_output()
to_generation_output()
Get completion tokens that are ready to be returned to the user.
This method retrieves tokens that have been generated but not yet delivered to the user, along with their associated log probability data.
-
Returns:
-
The completion tokens and their associated log probabilities, if available.
-
Return type:
tokens
tokens: TokenBuffer
update()
update(new_token, log_probabilities=None)
Advance both token buffer and FSM state.
This is the standard single-step update that most callers should use.
It combines advance_token_buffer() and advance_fsm() for the
common case where both need to be advanced together.
For multi-step execution where FSM is advanced separately (e.g., to compute bitmasks between steps), use the individual methods directly.
-
Parameters:
-
- new_token (int) – The token to append and consume.
- log_probabilities (LogProbabilities | None) – Optional log probabilities for this token.
-
Return type:
-
None
update_with_future_token()
update_with_future_token()
Append a placeholder future token to the generated tokens.
This is primarily used for overlap scheduling. For structured output contexts (those with a matcher), only the token buffer is advanced. The FSM will be advanced later when the future token is realized with the actual generated token.
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!