Mojo module

reduction

Functions

block_reduce: Performs a block-level reduction of a single SIMD value across all threads in a GPU thread block using warp-level primitives and shared memory.
reduce_kernel: GPU kernel that reduces rows along a given axis. Each block reduces one row at a time using row_reduce and writes the result via output_fn. Uses a grid-stride loop to handle more rows than blocks.
reduce_launch: Selects and launches the appropriate GPU reduction kernel based on the tensor shape, axis, and device saturation level. Dispatches to saturated_reduce_kernel for non-contiguous axes with enough rows, small_reduce_kernel for rows smaller than the warp size, or reduce_kernel otherwise.
row_reduce: Reduces a single row along the given axis using block-level cooperative reduction. Delegates to the multi-reduction row_reduce overload with num_reductions=1.
saturated_reduce_kernel: GPU kernel for reductions when the device is saturated with enough rows. Each thread independently reduces an entire row using SIMD packing, avoiding shared-memory synchronization entirely. Used when reducing along a non-contiguous axis.
small_reduce_kernel: GPU kernel optimized for rows smaller than the warp size. Each warp reduces an entire row independently, allowing multiple rows to be reduced per block without shared-memory synchronization.