Mojo function
topk_gpu
topk_gpu[type: DType, rank: Int, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True](ctx: DeviceContext, K: Int, input: NDBuffer[type, rank, origin], out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))
Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.
Parameters:
- type (
DType
): DType - The data type of the input tensor. - rank (
Int
): Int - The rank of the input tensor. - out_idx_type (
DType
): DType - The data type of the output indices (default is DType.index). - sampling (
Bool
): Bool - Whether to return token samples from topK dist (default is True). - largest (
Bool
): Bool - Whether to find the maximum or minimum value.
Args:
- ctx (
DeviceContext
): DeviceContext The context for GPU execution. - K (
Int
): Int - The number of top elements to keep. - input (
NDBuffer[type, rank, origin]
): NDBuffer[type, rank] Input tensor as a device NDBuffer. - out_vals (
NDBuffer[type, rank, origin]
): NDBuffer[type, rank] Output buffer on device for the K largest values. - out_idxs (
NDBuffer[out_idx_type, rank, origin]
): NDBuffer[DType.index, rank] Output buffer on device for the indices of the K largest values, or sampled token indices. Last dimension is 1 if sampling is True, otherwise K. - block_size (
OptionalReg[Int]
): Int The number of threads per block (default is 256 from TRT and empirical testing). - num_blocks_per_input (
OptionalReg[Int]
): OptionalReg[Int] Number of blocks per input (default computed from input size and block size). This is the equivalent of "BLOCKS_PER_BEAM" in TRT-LLM kernel allowing for much larger batch sizes through packing several elements per thread in the first stage. - temperature (
SIMD[type, 1]
): The temperature based scaling.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!