Skip to main content

Mojo function

topk_gpu

topk_gpu[type: DType, rank: Int, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True](ctx: DeviceContext, max_k: Int, input: NDBuffer[type, rank, origin], out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), k: OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]({:i1 0, 1}), temperature: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), top_p: OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]({:i1 0, 1}), seed: OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]({:i1 0, 1}))

Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.

Parameters:

  • type (DType): DType - The data type of the input tensor.
  • rank (Int): Int - The rank of the input tensor.
  • out_idx_type (DType): DType - The data type of the output indices (default is DType.index).
  • sampling (Bool): Bool - Whether to return token samples from topK dist (default is True).
  • largest (Bool): Bool - Whether to find the maximum or minimum value.

Args:

  • ctx (DeviceContext): DeviceContext The context for GPU execution.
  • max_k (Int): Int Largest number of top elements to keep for each batch element.
  • input (NDBuffer[type, rank, origin]): NDBuffer[type, rank] Input tensor as a device NDBuffer.
  • out_vals (NDBuffer[type, rank, origin]): NDBuffer[type, rank] Output buffer on device for the K largest values.
  • out_idxs (NDBuffer[out_idx_type, rank, origin]): NDBuffer[DType.index, rank] Output buffer on device for the indices of the K largest values, or sampled token indices. Last dimension is 1 if sampling is True, otherwise K.
  • block_size (OptionalReg[Int]): Int The number of threads per block (default is 256 from TRT and empirical testing).
  • num_blocks_per_input (OptionalReg[Int]): OptionalReg[Int] Number of blocks per input (default computed from input size and block size). This is the equivalent of "BLOCKS_PER_BEAM" in TRT-LLM kernel allowing for much larger batch sizes through packing several elements per thread in the first stage.
  • k (OptionalReg[NDBuffer[int64, 1, MutableAnyOrigin]]): Optional NDBuffer[DType.int64, 1, MutableAnyOrigin] Device buffer of top elements to keep for each batch element.
  • temperature (OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]): The temperature based scaling.
  • top_p (OptionalReg[NDBuffer[float32, 1, MutableAnyOrigin]]): Only use the tokens whose cumulative probability exceeds this threshold.
  • seed (OptionalReg[NDBuffer[uint64, 1, MutableAnyOrigin]]): The seed to use for the random number generator.

Was this page helpful?