Mojo function
topk_gpu
topk_gpu[dtype: DType, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True](ctx: DeviceContext, max_k: Int, input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out_vals: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out_idxs: LayoutTensor[out_idx_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), k: OptionalReg[LayoutTensor[DType.int64, Layout.row_major(-1), MutableAnyOrigin]] = OptionalReg[LayoutTensor[DType.int64, Layout.row_major(-1), MutableAnyOrigin]]({:i1 0, 1}), temperature: OptionalReg[LayoutTensor[DType.float32, Layout.row_major(-1), MutableAnyOrigin]] = OptionalReg[LayoutTensor[DType.float32, Layout.row_major(-1), MutableAnyOrigin]]({:i1 0, 1}), top_p: OptionalReg[LayoutTensor[DType.float32, Layout.row_major(-1), MutableAnyOrigin]] = OptionalReg[LayoutTensor[DType.float32, Layout.row_major(-1), MutableAnyOrigin]]({:i1 0, 1}), seed: OptionalReg[LayoutTensor[DType.uint64, Layout.row_major(-1), MutableAnyOrigin]] = OptionalReg[LayoutTensor[DType.uint64, Layout.row_major(-1), MutableAnyOrigin]]({:i1 0, 1}))
Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.
Parameters:
- dtype (DType): DType - The data dtype of the input tensor.
- out_idx_type (DType): DType - The data dtype of the output indices (default is DType.int).
- sampling (Bool): Bool - Whether to return token samples from topK dist (default is True).
- largest (Bool): Bool - Whether to find the maximum or minimum value.
Args:
- ctx (DeviceContext): DeviceContext The context for GPU execution.
- max_k (Int): Int Largest number of top elements to keep for each batch element.
- input (LayoutTensor): NDBuffer[dtype, rank] Input tensor as a device NDBuffer.
- out_vals (LayoutTensor): NDBuffer[dtype, rank] Output buffer on device for the K largest values.
- out_idxs (LayoutTensor): NDBuffer[DType.int, rank] Output buffer on device for the indices of the K largest values, or sampled token indices. Last dimension is 1 if sampling is True, otherwise K.
- block_size (OptionalReg): Int The number of threads per block (default is 256 from TRT and empirical testing).
- num_blocks_per_input (OptionalReg): OptionalReg[Int] Number of blocks per input (default computed from input size and block size). This is the equivalent of "BLOCKS_PER_BEAM" in TRT-LLM kernel allowing for much larger batch sizes through packing several elements per thread in the first stage.
- k (OptionalReg): Optional NDBuffer[DType.int64, 1, MutableAnyOrigin] Device buffer of top elements to keep for each batch element.
- temperature (OptionalReg): The temperature based scaling.
- top_p (OptionalReg): Only use the tokens whose cumulative probability exceeds this threshold.
- seed (OptionalReg): The seed to use for the random number generator.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
