Mojo function

lane_group_reduce

lane_group_reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], num_lanes: Int, *, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]

Performs a generic warp-level reduction operation using shuffle operations.

This function implements a parallel reduction across threads in a warp using a butterfly pattern. It allows customizing both the shuffle operation and reduction function.

Example:

    from gpu.warp import lane_group_reduce, shuffle_down

    # Compute sum across 16 threads using shuffle down
    @parameter
    fn add[dtype: DType, width: Int](x: SIMD[dtype, width], y: SIMD[dtype, width]) -> SIMD[dtype, width]:
        return x + y
    var val = SIMD[DType.float32, 16](42.0)
    var result = lane_group_reduce[shuffle_down, add, num_lanes=16](val)

Parameters:

val_type (DType): The data type of the SIMD elements (e.g. float32, int32).
simd_width (Int): The number of elements in the SIMD vector.
shuffle (fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]): A function that performs the warp shuffle operation. Takes a SIMD value and offset and returns the shuffled result.
func (fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]): A binary function that combines two SIMD values during reduction. This defines the reduction operation (e.g. add, max, min).
num_lanes (Int): The number of lanes in a group. The reduction is done within each group. Must be a power of 2.
stride (Int): The stride between lanes participating in the reduction.

Args:

val (SIMD): The SIMD value to reduce. Each lane contributes its value.

Returns:

SIMD: A SIMD value containing the reduction result.