Mojo function

max

max[dtype: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[dtype, width]) -> SIMD[dtype, width]

Computes the maximum value across all threads in a block.

Performs a parallel reduction using warp-level operations and shared memory to find the global maximum across all threads in the block.

Parameters:

dtype (DType): The data type of the SIMD elements.
width (Int): The number of elements in each SIMD vector.
block_size (Int): The total number of threads in the block.
broadcast (Bool): If True, the final reduced value is broadcast to all threads in the block. If False, only the first thread will have the complete result.

Args:

val (SIMD): The SIMD value to reduce. Each thread contributes its value to find the maximum.

Returns:

SIMD: If broadcast is True, each thread in the block will receive the maximum value across the entire block. Otherwise, only the first thread will have the complete result.

max[dtype: DType, width: Int, //, *, block_dim_x: Int, block_dim_y: Int, block_dim_z: Int = 1, broadcast: Bool = True](val: SIMD[dtype, width]) -> SIMD[dtype, width]

Computes the maximum value across all threads in a multi-dimensional block.

Performs a parallel reduction using warp-level operations and shared memory to find the global maximum across all threads in the block. Thread IDs are linearized in row-major order: x + y * dim_x + z * dim_x * dim_y.

Parameters:

dtype (DType): The data type of the SIMD elements.
width (Int): The number of elements in each SIMD vector.
block_dim_x (Int): The number of threads along the X dimension.
block_dim_y (Int): The number of threads along the Y dimension.
block_dim_z (Int): The number of threads along the Z dimension (default: 1).
broadcast (Bool): If True, the final reduced value is broadcast to all threads in the block. If False, only the first thread will have the complete result.

Args:

val (SIMD): The SIMD value to reduce. Each thread contributes its value to find the maximum.

Returns:

SIMD: If broadcast is True, each thread in the block will receive the maximum value across the entire block. Otherwise, only the first thread will have the complete result.