Skip to main content

Python class

TextGenerationContext

TextGenerationContext

class max.interfaces.TextGenerationContext(*args, **kwargs)

source

Bases: BaseContext, Protocol

Protocol defining the interface for text generation contexts in token generation.

A TextGenerationContext represents model inputs for text generation pipelines, managing the state of tokens throughout the generation process. It handles token arrays, generation status, sampling parameters, and various indices that track different stages of token processing.

advance_fsm()

advance_fsm(token)

source

Advance the FSM matcher state by one token.

This method advances only the FSM state for constrained decoding. It does NOT modify the token buffer. Use advance_token_buffer() separately if token buffer advancement is needed, or use update() for the common case of advancing both together.

Parameters:

token (int) – The token to consume in the FSM.

Returns:

True if the token was accepted by the matcher, False if no matcher is present.

Return type:

bool

advance_token_buffer()

advance_token_buffer(new_token, log_probabilities=None, mark_previous_as_processed=True)

source

Advance the token buffer without touching FSM state.

This method handles token buffer mutations including log probability storage, token buffer advancement, and EOS/max-length status updates. It does NOT advance the FSM matcher.

Use advance_fsm() separately if FSM advancement is needed, or use update() for the common case of advancing both together.

Parameters:

  • new_token (int) – The token to append to the buffer.
  • log_probabilities (LogProbabilities | None) – Optional log probabilities for this token.
  • mark_previous_as_processed (bool) – If True, mark previous tokens as processed. If False, keep them unprocessed so they’re returned to the user (used for jump-ahead tokens).

Return type:

None

cached_prefix_length

cached_prefix_length: int | None

source

Prompt tokens served from the KV prefix cache on first admission.

Set by the block manager when a request is admitted to a CE batch (0 if the cache had no matching prefix). BatchMetrics.create consumes the value to emit a per-request cache hit rate observation, then resets it to None so chunked-prefill follow-up calls do not re-emit.

compute_num_available_steps()

compute_num_available_steps(max_seq_len)

source

Computes the maximum number of generation steps available.

This method calculates how many tokens can be generated without exceeding the specified maximum sequence length limit.

Parameters:

max_seq_len (int) – The maximum allowed sequence length for this context.

Returns:

The number of generation steps that can be executed before reaching the sequence length limit.

Return type:

int

eos_tracker

property eos_tracker: EOSTracker

source

Holds EOS-related settings for this sequence and performs EOS/stop checks.

Returns:

The EOSTracker for this sequence.

get_min_token_logit_mask()

get_min_token_logit_mask(num_steps)

source

Returns the token indices that should be masked in the output logits.

This method is primarily used to implement the min_tokens constraint, where certain tokens (typically EOS tokens) are masked to prevent early termination before the minimum token count is reached.

Parameters:

num_steps (int) – The number of generation steps to compute masks for.

Returns:

A list of NumPy arrays, where each array contains token indices that should be masked (set to negative infinity) in the logits for the corresponding generation step.

Return type:

list[ndarray[tuple[Any, …], dtype[int32]]]

is_initial_prompt

property is_initial_prompt: bool

source

Whether this context contains only the initial prompt.

This property indicates if the context has not yet been updated with any generated tokens and still contains only the original input.

Returns:

True if no tokens have been generated yet, False if generation has begun and tokens have been added.

json_schema

property json_schema: str | None

source

The JSON schema for constrained decoding, if configured.

When set, this schema constrains token generation to produce valid JSON output that conforms to the specified structure.

Returns:

The JSON schema string, or None if no schema constraint is active.

jump_ahead()

jump_ahead(new_token)

source

Jump ahead in generation by adding a token and updating indices.

This method is used in speculative decoding scenarios to quickly advance the generation state when draft tokens are accepted.

Parameters:

new_token (int) – The token ID to add when jumping ahead in the sequence.

Return type:

None

log_probabilities

property log_probabilities: int

source

The number of top tokens to return log probabilities for.

When greater than 0, the system returns log probabilities for the top N most likely tokens at each generation step.

Returns:

The number of top tokens to include in log probability output. Returns 0 if log probabilities are disabled.

log_probabilities_echo

property log_probabilities_echo: bool

source

Whether to include input tokens in the returned log probabilities.

When True, log probabilities will be computed and returned for input (prompt) tokens in addition to generated tokens.

Returns:

True if input tokens should be included in log probability output, False otherwise.

matcher

property matcher: Any | None

source

The grammar matcher for structured output generation, if configured.

The matcher enforces structural constraints (like JSON schema) during generation to ensure valid formatted output.

Returns:

The grammar matcher instance, or None if no structured generation is configured for this context.

max_length

property max_length: int | None

source

The maximum allowed length for this sequence.

When set, generation will stop when this length is reached, regardless of other stopping criteria.

Returns:

The maximum sequence length limit, or None if no limit is set.

min_tokens

property min_tokens: int

source

The minimum number of new tokens that must be generated.

Generation will continue until at least this many new tokens have been produced, even if other stopping criteria are met (for example, EOS tokens).

Returns:

The minimum number of new tokens to generate.

realize_future_token()

realize_future_token(new_token, log_probabilities=None)

source

Overwrite the placeholder future token with the actual token.

This is primarily used for overlap scheduling.

Parameters:

Return type:

None

reset()

reset()

source

Resets the context’s state by combining all tokens into a new prompt.

This method is used when a request is evicted, meaning that the context needed to be re-encoded in the following CE iteration.

Return type:

None

sampling_params

property sampling_params: SamplingParams

source

The sampling parameters configured for this generation request.

These parameters control how tokens are selected during generation, including temperature, top-k/top-p filtering, and stopping criteria.

Returns:

The SamplingParams instance containing all sampling configuration for this context.

set_matcher()

set_matcher(matcher)

source

Set a grammar matcher for constrained decoding.

This method configures structured output generation by installing a grammar matcher that enforces format constraints during token generation.

Parameters:

matcher (Any) – The grammar matcher instance to use for constraining output. The specific type depends on the structured generation backend.

Return type:

None

spec_decoding_state

property spec_decoding_state: SpecDecodingState

source

Returns the speculative decoding state.

to_generation_output()

to_generation_output()

source

Converts this context to a TextGenerationOutput object.

Provides a standardized way to extract the final output of the text generation process from the context, including generated text, tokens, and any associated metadata.

Returns:

The output object containing the results of the text generation for this context.

Return type:

TextGenerationOutput

tokens

property tokens: TokenBuffer

source

The token buffer for the context.

update()

update(new_token, log_probabilities=None)

source

Advance both token buffer and FSM state.

This is the standard single-step update that most callers should use. It combines advance_token_buffer() and advance_fsm() for the common case where both need to be advanced together.

For multi-step execution where FSM is advanced separately (e.g., to compute bitmasks between steps), use the individual methods directly.

Parameters:

  • new_token (int) – The token ID to add to the generation sequence.
  • log_probabilities (LogProbabilities | None) – Optional log probability data for the new token and alternatives. Used for analysis and debugging.

Return type:

None

update_with_future_token()

update_with_future_token()

source

Append a placeholder future token to the generated tokens.

This is primarily used for overlap scheduling.

Return type:

None