For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python class
FusedSamplingProcessor
FusedSamplingProcessorβ
class max.pipelines.sampling.FusedSamplingProcessor(sampler, pipeline_config, context_batch, num_steps, device, bitmask=None, vocab_size=None, pinned_new_tokens=None, identity_logit_offsets=None)
Bases: object
Applies sampling parameters to logits and stores the chosen tokens.
-
Parameters:
allocate_identity_logit_offsets()β
static allocate_identity_logit_offsets(pipeline_config, device)
Returns a preallocated [0, 1, ..., max_batch_size] index buffer.
Used by logits_for_sampling when sampling from next_token_logits
with a variable-logit sampler. Returns None when the buffer is not
needed (variable-logit sampling disabled, or running in virtual-device
mode).
-
Parameters:
-
- pipeline_config (PipelineConfig)
- device (Device)
-
Return type:
-
Buffer | None
generated_tokensβ
generated_tokens: Buffer
The generated tokens that have been sampled so far.
get_new_tokens_numpy()β
get_new_tokens_numpy()
Wait for D2H copy and return the new tokens as numpy array.
If async copy was started via start_async_token_copy(), this waits for the copy event. Otherwise, falls back to synchronous copy.
logits_for_sampling()β
logits_for_sampling(*, logits, next_token_logits, logit_offsets)
Returns the logits and offsets to pass to logits processors.
new_tokensβ
The new tokens that were sampled.
start_async_token_copy()β
start_async_token_copy()
Start D2H copy of new_tokens to pinned buffer on the default stream.
The copy happens on the default stream after sampling completes. We record an event after the copy so get_new_tokens_numpy() can wait for just the copy without waiting for subsequent GPU operations (like the next forward pass).
-
Return type:
-
None
update_bitmask()β
update_bitmask(packed_bitmask)
Update the GPU bitmask with new FSM state for multi-step execution.
This method unpacks the packed-int bitmask from llguidance, copies it to the pinned host buffer, and transfers it to the GPU. This keeps the bitmask synchronized with the FSM state after each token is sampled.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!