Python class
TokenBuffer
TokenBuffer
class max.interfaces.TokenBuffer(array)
Bases: object
A dynamically resizable container for managing token sequences.
TokenBuffer provides efficient storage and access to token sequences
during text generation. It maintains the prompt tokens (initial input) and
generated tokens (model output) separately, while handling automatic memory
management as new tokens are added.
TokenBuffer organizes tokens across three related views:
- The full stored sequence (
all), split intopromptandgeneratedtokens. - The processing window (
activeversus processed and pending tokens). - The streaming window over newly generated tokens consumed by callers.
The first diagram shows how prompt and generated tokens share a single backing array. Later diagrams explain how processing and streaming walk over that array during generation:
+-------------------- all --------------------+
+-----------------+---------------------------+
| prompt | generated |
+-----------------+---------------------------+
0 prompt_length ^ generated_length ^
0 len(self) ^This includes three attributes for accessing tokens:
all: The slice of the array containing all valid tokens.prompt: The slice of the array containing the prompt tokens.generated: The slice of the array containing the generated tokens.
Along with three attributes for tracking their lengths:
- prompt_length: The number of tokens in the prompt.
- generated_length: The number of tokens in the generated tokens.
- len(self): The total number of valid tokens in the buffer.
Processing window (what the model will process next):
+-------------------------------- all -------------------------+
+-------------------+---------------------------+---------------+
| processed | active | pending |
+-------------------+---------------------------+---------------+
0 processed_length ^ active_length ^
0 current_position ^
0 len(self) ^In the above, processed tracks tokens which has already been processed,
active tracks tokens, which are scheduled to be processed in the next batch,
and pending tracks tokens, which have not yet been processed, but are not
actively scheduled to be processed in the next batch (this commonly
occurs during chunked prefill).
This includes one attribute for accessing tokens:
active: The slice of the array containing the tokens scheduled for processing in the next batch.
Along with three additional attributes for tracking their lengths:
- processed_length: The number of tokens that have already been processed.
- active_length: The number of tokens that is currently scheduled for processing in the next batch.
- current_position: The global index marking the end of the current active processing window.
This processing view is updated by methods such as rewind_processing,
skip_processing, chunk, and advance_chunk/advance_with_token,
which control how much of the existing sequence is reprocessed or advanced
at each step.
It also maintains a completion window over the generated tokens for completion streaming:
+------------- generated -------------+
+------------+------------------------+
| streamed | ready to stream next |
+------------+------------------------+
| (1) | (2) |Generated tokens are conceptually split into:
- streamed: tokens that have already been returned to the caller.
- read to stream: the newest generated tokens that have not yet been returned.
Each call to consume_recently_generated_tokens() returns the (2) region
and advances the boundary between (1) and (2), so subsequent calls only
see newly generated tokens.
Together, these three views let TokenBuffer support efficient prompt
handling, chunked processing, and incremental streaming while exposing a small,
consistent public API.
Initialize a TokenBuffer with the given token array.
-
Parameters:
-
array (ndarray[tuple[Any, ...], dtype[int64]]) – A 1D numpy array of int64 token IDs. Must be non-empty.
-
Raises:
-
ValueError – If the array is not 1-dimensional, not int64 dtype, or empty.
active
Return the tokens queued for the next processing step.
active_length
property active_length: int
Count of tokens currently scheduled for processing.
actively_chunked
property actively_chunked: bool
Check if the buffer has active chunk limits applied.
-
Returns:
-
True if chunk limits are active, False otherwise.
advance_chunk()
advance_chunk()
Move to the next set of tokens after a limited chunk.
Call this after maybe_chunk when you have finished working with the current active tokens and want the remaining tokens in the sequence to become active.
-
Raises:
-
ValueError – If called before maybe_chunk has limited the active tokens (that is, when no chunk is currently active).
-
Return type:
-
None
advance_with_token()
advance_with_token(token, mark_previous_as_processed=True)
Add a new token to the buffer.
all
Return every valid token currently stored (prompt + generated).
Use this when downstream components need the full sequence for scoring, logging, or serialization.
apply_processing_offset()
apply_processing_offset(value)
Set the processing offset.
-
Parameters:
-
value (int) – The new processing offset.
-
Return type:
-
None
array
In-place storage holding the prompt plus any generated tokens.
chunk()
chunk(chunk_size)
Limit the upcoming processing step to at most n tokens.
-
Parameters:
-
chunk_size (int) – Maximum number of tokens to process.
-
Raises:
-
ValueError – If chunk_size is not between 1 and the current number of active tokens.
-
Return type:
-
None
consume_recently_generated_tokens()
consume_recently_generated_tokens()
Return newly generated tokens since the last consumption.
-
Returns:
-
A slice containing tokens ready to stream to the caller.
-
Raises:
-
ValueError – If no new tokens are available.
-
Return type:
current_position
property current_position: int
Global index marking the end of the current active processing window.
Equal to processed_length + active_length; represents the index of the next token to be processed, which may be less than the total length when processing is limited by chunking.
generated
Return all tokens produced after the prompt.
Use this slice for stop checks, repetition penalties, or any logic that should consider only newly generated content.
generated_length
property generated_length: int
Number of tokens generated after the prompt.
has_outstanding_generated_tokens
property has_outstanding_generated_tokens: bool
Indicates whether there are generated tokens that have not yet been consumed.
-
Returns:
-
Trueif there are outstanding generated tokens to be streamed or processed,Falseotherwise.
overwrite_last_token()
overwrite_last_token(token)
Overwrite the last token in the buffer.
-
Parameters:
-
token (int)
-
Return type:
-
None
processed_length
property processed_length: int
Number of tokens that have already been processed.
prompt
Return only the original prompt tokens.
Helpful for echo suppression, prompt-side metrics, or offset calculations that should exclude generated output.
prompt_length
property prompt_length: int
Number of tokens that belong to the prompt.
reset_as_new_prompt()
reset_as_new_prompt(delete_last_generated_token=False)
Treat the current sequence as a fresh prompt.
Marks all existing tokens as prompt tokens so the next generation pass starts from this state.
-
Parameters:
-
delete_last_generated_token (bool) – If True, deletes the last generated token before resetting the buffer. This is useful when the last token is a placeholder future token.
-
Raises:
-
ValueError – If the buffer state is invalid.
-
Return type:
-
None
rewind_processing()
rewind_processing(n)
Re-expose n earlier tokens so they can be processed again.
-
Parameters:
-
n (int) – Number of tokens to move back into the active window.
-
Raises:
-
ValueError – If n is negative.
-
Return type:
-
None
skip_processing()
skip_processing(n)
Advance the active window start by n tokens.
-
Parameters:
-
n (int) – Number of tokens to drop from the active window.
-
Raises:
-
ValueError – If n exceeds the number of available tokens to process, or if skipping n tokens would leave 0 active tokens.
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!