Mojo function
topk_gpu
topk_gpu[dtype: DType, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True, _force_old_impl: Bool = False, KLayoutType: TensorLayout = Layout[*?, *?], TemperatureLayoutType: TensorLayout = Layout[*?, *?], TopPLayoutType: TensorLayout = Layout[*?, *?], MinPLayoutType: TensorLayout = Layout[*?, *?], SeedLayoutType: TensorLayout = Layout[*?, *?]](ctx: DeviceContext, max_k: Int, input: TileTensor[dtype, address_space=input.address_space, linear_idx_type=input.linear_idx_type, element_size=input.element_size], out_vals: TileTensor[dtype, address_space=out_vals.address_space, linear_idx_type=out_vals.linear_idx_type, element_size=out_vals.element_size], out_idxs: TileTensor[out_idx_type, address_space=out_idxs.address_space, linear_idx_type=out_idxs.linear_idx_type, element_size=out_idxs.element_size], block_size: Optional[Int] = None, num_blocks_per_input: Optional[Int] = None, k: Optional[TileTensor[DType.int64, KLayoutType, ImmutAnyOrigin]] = None, temperature: Optional[TileTensor[DType.float32, TemperatureLayoutType, ImmutAnyOrigin]] = None, top_p: Optional[TileTensor[DType.float32, TopPLayoutType, ImmutAnyOrigin]] = None, min_p: Optional[TileTensor[DType.float32, MinPLayoutType, ImmutAnyOrigin]] = None, seed: Optional[TileTensor[DType.uint64, SeedLayoutType, ImmutAnyOrigin]] = None)
Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.
Parameters:
- βdtype (
DType): DType - The data dtype of the input tensor. - βout_idx_type (
DType): DType - The data dtype of the output indices (default == DType.int). - βsampling (
Bool): Bool - Whether to return token samples from topK dist (default is True). - βlargest (
Bool): Bool - Whether to find the maximum or minimum value. - β_force_old_impl (
Bool): Bool - Whether to force use the old implementation. - βKLayoutType (
TensorLayout): Layout type of the k buffer. - βTemperatureLayoutType (
TensorLayout): Layout type of the temperature buffer. - βTopPLayoutType (
TensorLayout): Layout type of the top_p buffer. - βMinPLayoutType (
TensorLayout): Layout type of the min_p buffer. - βSeedLayoutType (
TensorLayout): Layout type of the seed buffer.
Args:
- βctx (
DeviceContext): DeviceContext The context for GPU execution. - βmax_k (
Int): Int Largest number of top elements to keep for each batch element. - βinput (
TileTensor[dtype, address_space=input.address_space, linear_idx_type=input.linear_idx_type, element_size=input.element_size]): TileTensor[dtype] Input tensor as a device TileTensor. - βout_vals (
TileTensor[dtype, address_space=out_vals.address_space, linear_idx_type=out_vals.linear_idx_type, element_size=out_vals.element_size]): TileTensor[dtype] Output buffer on device for the K largest values. - βout_idxs (
TileTensor[out_idx_type, address_space=out_idxs.address_space, linear_idx_type=out_idxs.linear_idx_type, element_size=out_idxs.element_size]): TileTensor[DType.int] Output buffer on device for the indices of the K largest values, or sampled token indices. Last dimension is 1 if sampling is True, otherwise K. - βblock_size (
Optional[Int]): Int The number of threads per block (default is 256 from TRT and empirical testing). - βnum_blocks_per_input (
Optional[Int]): Optional[Int] Number of blocks per input (default computed from input size and block size). This is the equivalent of "BLOCKS_PER_BEAM" in TRT-LLM kernel allowing for much larger batch sizes through packing several elements per thread in the first stage. - βk (
Optional[TileTensor[DType.int64, KLayoutType, ImmutAnyOrigin]]): Optional TileTensor[DType.int64] Device buffer of top elements to keep for each batch element. - βtemperature (
Optional[TileTensor[DType.float32, TemperatureLayoutType, ImmutAnyOrigin]]): The temperature based scaling. - βtop_p (
Optional[TileTensor[DType.float32, TopPLayoutType, ImmutAnyOrigin]]): Only use the tokens whose cumulative probability exceeds this threshold. - βmin_p (
Optional[TileTensor[DType.float32, MinPLayoutType, ImmutAnyOrigin]]): Per-row min-p threshold. Tokens with probability belowmin_p * max_probare excluded from sampling. - βseed (
Optional[TileTensor[DType.uint64, SeedLayoutType, ImmutAnyOrigin]]): The seed to use for the random number generator.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!