Skip to main content

Python class

TokenBuffer

TokenBuffer

class max.interfaces.TokenBuffer(array)

source

Bases: object

A dynamically resizable container for managing token sequences.

TokenBuffer provides efficient storage and access to token sequences during text generation. It maintains the prompt tokens (initial input) and generated tokens (model output) separately, while handling automatic memory management as new tokens are added.

TokenBuffer organizes tokens across three related views:

  1. The full stored sequence (all), split into prompt and generated tokens.
  2. The processing window (active versus processed and pending tokens).
  3. The streaming window over newly generated tokens consumed by callers.

The first diagram shows how prompt and generated tokens share a single backing array. Later diagrams explain how processing and streaming walk over that array during generation:

+-------------------- all --------------------+
+-----------------+---------------------------+
|     prompt      |        generated          |
+-----------------+---------------------------+
0   prompt_length ^          generated_length ^
0                                   len(self) ^

This includes three attributes for accessing tokens:

  • all: The slice of the array containing all valid tokens.
  • prompt: The slice of the array containing the prompt tokens.
  • generated: The slice of the array containing the generated tokens.

Along with three attributes for tracking their lengths:

  • prompt_length: The number of tokens in the prompt.
  • generated_length: The number of tokens in the generated tokens.
  • len(self): The total number of valid tokens in the buffer.

Processing window (what the model will process next):

+-------------------------------- all  -------------------------+
+-------------------+---------------------------+---------------+
|     processed     |          active           |    pending    |
+-------------------+---------------------------+---------------+
0  processed_length ^             active_length ^
0                              current_position ^
0                                                     len(self) ^

In the above, processed tracks tokens which has already been processed, active tracks tokens, which are scheduled to be processed in the next batch, and pending tracks tokens, which have not yet been processed, but are not actively scheduled to be processed in the next batch (this commonly occurs during chunked prefill).

This includes one attribute for accessing tokens:

  • active: The slice of the array containing the tokens scheduled for processing in the next batch.

Along with three additional attributes for tracking their lengths:

  • processed_length: The number of tokens that have already been processed.
  • active_length: The number of tokens that is currently scheduled for processing in the next batch.
  • current_position: The global index marking the end of the current active processing window.

This processing view is updated by methods such as rewind_processing, skip_processing, chunk, and advance_chunk/advance_with_token, which control how much of the existing sequence is reprocessed or advanced at each step.

It also maintains a completion window over the generated tokens for completion streaming:

+------------- generated -------------+
+------------+------------------------+
|  streamed  |  ready to stream next  |
+------------+------------------------+
|     (1)    |          (2)           |

Generated tokens are conceptually split into:

  1. streamed: tokens that have already been returned to the caller.
  2. read to stream: the newest generated tokens that have not yet been returned.

Each call to consume_recently_generated_tokens() returns the (2) region and advances the boundary between (1) and (2), so subsequent calls only see newly generated tokens.

Together, these three views let TokenBuffer support efficient prompt handling, chunked processing, and incremental streaming while exposing a small, consistent public API.

Initialize a TokenBuffer with the given token array.

Parameters:

array (ndarray[tuple[Any, ...], dtype[int64]]) – A 1D numpy array of int64 token IDs. Must be non-empty.

Raises:

ValueError – If the array is not 1-dimensional, not int64 dtype, or empty.

active

property active: ndarray[tuple[Any, ...], dtype[int64]]

source

Return the tokens queued for the next processing step.

active_length

property active_length: int

source

Count of tokens currently scheduled for processing.

actively_chunked

property actively_chunked: bool

source

Check if the buffer has active chunk limits applied.

Returns:

True if chunk limits are active, False otherwise.

advance_chunk()

advance_chunk()

source

Move to the next set of tokens after a limited chunk.

Call this after maybe_chunk when you have finished working with the current active tokens and want the remaining tokens in the sequence to become active.

Raises:

ValueError – If called before maybe_chunk has limited the active tokens (that is, when no chunk is currently active).

Return type:

None

advance_with_token()

advance_with_token(token, mark_previous_as_processed=True)

source

Add a new token to the buffer.

Parameters:

  • token (int) – The token ID to add.
  • mark_previous_as_processed (bool) – If False, expands the set of active tokens instead of shifting forward. This is useful for speculative execution scenarios where multiple tokens may be generated.

Return type:

None

all

property all: ndarray[tuple[Any, ...], dtype[int64]]

source

Return every valid token currently stored (prompt + generated).

Use this when downstream components need the full sequence for scoring, logging, or serialization.

apply_processing_offset()

apply_processing_offset(value)

source

Set the processing offset.

Parameters:

value (int) – The new processing offset.

Return type:

None

array

array: ndarray[tuple[Any, ...], dtype[int64]]

source

In-place storage holding the prompt plus any generated tokens.

chunk()

chunk(chunk_size)

source

Limit the upcoming processing step to at most n tokens.

Parameters:

chunk_size (int) – Maximum number of tokens to process.

Raises:

ValueError – If chunk_size is not between 1 and the current number of active tokens.

Return type:

None

consume_recently_generated_tokens()

consume_recently_generated_tokens()

source

Return newly generated tokens since the last consumption.

Returns:

A slice containing tokens ready to stream to the caller.

Raises:

ValueError – If no new tokens are available.

Return type:

ndarray[tuple[Any, …], dtype[int64]]

current_position

property current_position: int

source

Global index marking the end of the current active processing window.

Equal to processed_length + active_length; represents the index of the next token to be processed, which may be less than the total length when processing is limited by chunking.

generated

property generated: ndarray[tuple[Any, ...], dtype[int64]]

source

Return all tokens produced after the prompt.

Use this slice for stop checks, repetition penalties, or any logic that should consider only newly generated content.

generated_length

property generated_length: int

source

Number of tokens generated after the prompt.

has_outstanding_generated_tokens

property has_outstanding_generated_tokens: bool

source

Indicates whether there are generated tokens that have not yet been consumed.

Returns:

True if there are outstanding generated tokens to be streamed or processed, False otherwise.

overwrite_last_token()

overwrite_last_token(token)

source

Overwrite the last token in the buffer.

Parameters:

token (int)

Return type:

None

processed_length

property processed_length: int

source

Number of tokens that have already been processed.

prompt

property prompt: ndarray[tuple[Any, ...], dtype[int64]]

source

Return only the original prompt tokens.

Helpful for echo suppression, prompt-side metrics, or offset calculations that should exclude generated output.

prompt_length

property prompt_length: int

source

Number of tokens that belong to the prompt.

reset_as_new_prompt()

reset_as_new_prompt(delete_last_generated_token=False)

source

Treat the current sequence as a fresh prompt.

Marks all existing tokens as prompt tokens so the next generation pass starts from this state.

Parameters:

delete_last_generated_token (bool) – If True, deletes the last generated token before resetting the buffer. This is useful when the last token is a placeholder future token.

Raises:

ValueError – If the buffer state is invalid.

Return type:

None

rewind_processing()

rewind_processing(n)

source

Re-expose n earlier tokens so they can be processed again.

Parameters:

n (int) – Number of tokens to move back into the active window.

Raises:

ValueError – If n is negative.

Return type:

None

skip_processing()

skip_processing(n)

source

Advance the active window start by n tokens.

Parameters:

n (int) – Number of tokens to drop from the active window.

Raises:

ValueError – If n exceeds the number of available tokens to process, or if skipping n tokens would leave 0 active tokens.

Return type:

None