Mojo function

topk_gpu

topk_gpu[dtype: DType, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True, _force_old_impl: Bool = False, KLayoutType: TensorLayout = Layout[RuntimeInt[DType.int64], ComptimeInt[1]], TemperatureLayoutType: TensorLayout = Layout[RuntimeInt[DType.int64], ComptimeInt[1]], TopPLayoutType: TensorLayout = Layout[RuntimeInt[DType.int64], ComptimeInt[1]], SeedLayoutType: TensorLayout = Layout[RuntimeInt[DType.int64], ComptimeInt[1]]](ctx: DeviceContext, max_k: Int, input: TileTensor[dtype, LayoutType, origin, address_space=address_space, linear_idx_type=linear_idx_type, element_shape_types=element_shape_types], out_vals: TileTensor[dtype, LayoutType, origin, address_space=address_space, linear_idx_type=linear_idx_type, element_shape_types=element_shape_types], out_idxs: TileTensor[out_idx_type, LayoutType, origin, address_space=address_space, linear_idx_type=linear_idx_type, element_shape_types=element_shape_types], block_size: Optional[Int] = None, num_blocks_per_input: Optional[Int] = None, k: Optional[TileTensor[DType.int64, KLayoutType, ImmutAnyOrigin]] = None, temperature: Optional[TileTensor[DType.float32, TemperatureLayoutType, ImmutAnyOrigin]] = None, top_p: Optional[TileTensor[DType.float32, TopPLayoutType, ImmutAnyOrigin]] = None, seed: Optional[TileTensor[DType.uint64, SeedLayoutType, ImmutAnyOrigin]] = None)

Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.

Parameters:

dtype (DType): DType - The data dtype of the input tensor.
out_idx_type (DType): DType - The data dtype of the output indices (default == DType.int).
sampling (Bool): Bool - Whether to return token samples from topK dist (default is True).
largest (Bool): Bool - Whether to find the maximum or minimum value.
_force_old_impl (Bool): Bool - Whether to force use the old implementation.
KLayoutType (TensorLayout): Layout type of the k buffer.
TemperatureLayoutType (TensorLayout): Layout type of the temperature buffer.
TopPLayoutType (TensorLayout): Layout type of the top_p buffer.
SeedLayoutType (TensorLayout): Layout type of the seed buffer.

Args:

ctx (DeviceContext): DeviceContext The context for GPU execution.
max_k (Int): Int Largest number of top elements to keep for each batch element.
input (TileTensor): NDBuffer[dtype, rank] Input tensor as a device NDBuffer.
out_vals (TileTensor): NDBuffer[dtype, rank] Output buffer on device for the K largest values.
out_idxs (TileTensor): NDBuffer[DType.int, rank] Output buffer on device for the indices of the K largest values, or sampled token indices. Last dimension is 1 if sampling is True, otherwise K.
block_size (Optional): Int The number of threads per block (default is 256 from TRT and empirical testing).
num_blocks_per_input (Optional): Optional[Int] Number of blocks per input (default computed from input size and block size). This is the equivalent of "BLOCKS_PER_BEAM" in TRT-LLM kernel allowing for much larger batch sizes through packing several elements per thread in the first stage.
k (Optional): Optional NDBuffer[DType.int64, 1, MutAnyOrigin] Device buffer of top elements to keep for each batch element.
temperature (Optional): The temperature based scaling.
top_p (Optional): Only use the tokens whose cumulative probability exceeds this threshold.
seed (Optional): The seed to use for the random number generator.