Mojo module
reduction
Functions
-
block_reduce: Performs a block-level reduction of a single SIMD value across all threads in a GPU thread block using warp-level primitives and shared memory. -
reduce_kernel: GPU kernel that reduces rows along a given axis. Each block reduces one row at a time usingrow_reduceand writes the result viaoutput_fn. Uses a grid-stride loop to handle more rows than blocks. -
reduce_launch: Selects and launches the appropriate GPU reduction kernel based on the tensor shape, axis, and device saturation level. Dispatches tosaturated_reduce_kernelfor non-contiguous axes with enough rows,small_reduce_kernelfor rows smaller than the warp size, orreduce_kernelotherwise. -
row_reduce: Reduces a single row along the given axis using block-level cooperative reduction. Delegates to the multi-reductionrow_reduceoverload withnum_reductions=1. -
saturated_reduce_kernel: GPU kernel for reductions when the device is saturated with enough rows. Each thread independently reduces an entire row using SIMD packing, avoiding shared-memory synchronization entirely. Used when reducing along a non-contiguous axis. -
small_reduce_kernel: GPU kernel optimized for rows smaller than the warp size. Each warp reduces an entire row independently, allowing multiple rows to be reduced per block without shared-memory synchronization.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!