For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

FusedSamplingProcessor

`FusedSamplingProcessor`

class max.pipelines.sampling.FusedSamplingProcessor(sampler, pipeline_config, context_batch, device, bitmask=None, vocab_size=None, pinned_new_tokens=None, identity_logit_offsets=None)

source

Bases: object

Applies sampling parameters to logits and stores the chosen tokens.

Parameters:

sampler (Model)
pipeline_config (PipelineConfig)
context_batch (list[Any])
device (Device)
bitmask (npt.NDArray[np.int32] | None)
vocab_size (int | None)
pinned_new_tokens (Buffer | None)
identity_logit_offsets (Buffer | None)

`allocate_identity_logit_offsets()`

static allocate_identity_logit_offsets(pipeline_config, device, max_batch_size)

source

Returns a preallocated [0, 1, ..., max_batch_size] index buffer.

Used by logits_for_sampling when sampling from next_token_logits with a variable-logit sampler. Returns None when the buffer is not needed (variable-logit sampling disabled, or running in virtual-device mode).

Parameters:

pipeline_config (PipelineConfig)
device (Device)
max_batch_size (int)

Return type:

Buffer | None

`generated_tokens`

generated_tokens: Buffer

source

The generated tokens that have been sampled so far.

`get_new_tokens_numpy()`

get_new_tokens_numpy()

source

Wait for D2H copy and return the new tokens as numpy array.

If async copy was started via start_async_token_copy(), this waits for the copy event. Otherwise, falls back to synchronous copy.

Returns:: Numpy array of the new tokens with shape (batch_size,).
Return type:: ndarray[tuple[Any, …], dtype[int64]]

`logits_for_sampling()`

logits_for_sampling(*, logits, next_token_logits, logit_offsets)

source

Returns the logits and offsets to pass to logits processors.

Parameters:

logits (Buffer)
next_token_logits (Buffer | None)
logit_offsets (Buffer | None)

Return type:

tuple[Buffer, Buffer | None]

`new_tokens`

new_tokens: Buffer | None = None

source

The new tokens that were sampled.

`start_async_token_copy()`

start_async_token_copy()

source

Start D2H copy of new_tokens to pinned buffer on the default stream.

The copy happens on the default stream after sampling completes. We record an event after the copy so get_new_tokens_numpy() can wait for just the copy without waiting for subsequent GPU operations (like the next forward pass).

Return type:: None

`update_bitmask()`

update_bitmask(packed_bitmask)

source

Update the GPU bitmask with new FSM state for multi-step execution.

Copies the packed int32 bitmask from llguidance into the pinned host buffer and transfers it to the GPU, keeping the bitmask synchronized with the FSM state after each token is sampled. Unpacking is done on the GPU by the sampler graph (apply_packed_bitmask), not here.

Parameters:: packed_bitmask (ndarray[tuple[Any, ...], dtype[int32]]) – Packed int32 bitmask from llguidance.numpy.allocate_token_bitmask. Shape is [batch_size, ceil(vocab_size/32)].
Return type:: None

FusedSamplingProcessor​

allocate_identity_logit_offsets()​

generated_tokens​

get_new_tokens_numpy()​

logits_for_sampling()​

new_tokens​

start_async_token_copy()​

update_bitmask()​

`FusedSamplingProcessor`

`allocate_identity_logit_offsets()`

`generated_tokens`

`get_new_tokens_numpy()`

`logits_for_sampling()`

`new_tokens`

`start_async_token_copy()`

`update_bitmask()`