For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

topk_softmax_sample

def topk_softmax_sample[dtype: DType, out_idx_type: DType, block_size: Int = 1024, TopKArrLayoutType: TensorLayout = Layout[*?, *?], TemperatureLayoutType: TensorLayout = Layout[*?, *?], SeedLayoutType: TensorLayout = Layout[*?, *?]](ctx: DeviceContext, logits: TileTensor[dtype, linear_idx_type=logits.linear_idx_type, element_size=logits.element_size], sampled_indices: TileTensor[out_idx_type, linear_idx_type=sampled_indices.linear_idx_type, element_size=sampled_indices.element_size], top_k_val: Int, temperature_val: Float32 = 1, seed_val: UInt64 = UInt64(0), top_k_arr: Optional[TileTensor[out_idx_type, TopKArrLayoutType, MutUntrackedOrigin]] = None, temperature: Optional[TileTensor[DType.float32, TemperatureLayoutType, MutUntrackedOrigin]] = None, seed: Optional[TileTensor[DType.uint64, SeedLayoutType, MutUntrackedOrigin]] = None)

Samples token indices from top-K logits using softmax probabilities.

This kernel performs single-pass top-K selection and categorical sampling:

Finds the k-th largest logit via ternary search.
Computes softmax over top-K elements and caches them in shared memory.
Samples a single token index from the categorical distribution.

Parameters:

dtype (DType): The data type of the input logits tensor.
out_idx_type (DType): The data type of the output sampled indices.
block_size (Int): The number of threads per block (default is 1024).
TopKArrLayoutType (TensorLayout): The layout type of the optional top_k_arr tensor.
TemperatureLayoutType (TensorLayout): The layout type of the optional temperature tensor.
SeedLayoutType (TensorLayout): The layout type of the optional seed tensor.

Args:

ctx (DeviceContext): DeviceContext The context for GPU execution.
logits (TileTensor[dtype, linear_idx_type=logits.linear_idx_type, element_size=logits.element_size]): Input logits tensor with shape [batch_size, vocab_size].
sampled_indices (TileTensor[out_idx_type, linear_idx_type=sampled_indices.linear_idx_type, element_size=sampled_indices.element_size]): Output buffer for sampled token indices with shape [batch_size].
top_k_val (Int): Int Default number of top elements to sample from for each batch element.
temperature_val (Float32): Float32 Temperature for softmax scaling (default is 1.0).
seed_val (UInt64): UInt64 Seed for the random number generator (default is 0).
top_k_arr (Optional[TileTensor[out_idx_type, TopKArrLayoutType, MutUntrackedOrigin]]): Optional per-batch top-K values. If provided, overrides top_k_val for each batch element.
temperature (Optional[TileTensor[DType.float32, TemperatureLayoutType, MutUntrackedOrigin]]): Optional per-batch temperature values. If provided, overrides temperature_val for each batch element.
seed (Optional[TileTensor[DType.uint64, SeedLayoutType, MutUntrackedOrigin]]): Optional per-batch seed values. If provided, overrides seed_val for each batch element.