function

# vectorize

vectorize[func: fn[Int](Int, /) capturing -> None, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int)

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The remainder of size % simd_width will run in separate iterations.

The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:

from algorithm.functional import vectorize# The amount of elements to loop throughalias size = 10# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)alias simd_width = simdwidthof[DType.int32]()fn main():    var p = DTypePointer[DType.int32].alloc(size)    # @parameter allows the closure to capture the p pointer    @parameter    fn closure[simd_width: Int](i: Int):        print("storing", simd_width, "els at pos", i)        p.store[width=simd_width](i, i)    vectorize[closure, simd_width](size)    print(p.load[width=size]())

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in two separate iterations:

storing 4 els at pos 0storing 4 els at pos 4storing 1 els at pos 8storing 1 els at pos 9[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]

You can also unroll the loop to potentially improve performance at the cost of binary size:

vectorize[closure, width, unroll_factor=2](size)

In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in psuedocode:

closure[4](0)closure[4](4)# Remainder loop won't unroll unless size is passed as a parameterfor i in range(8, 10):    closure[1](i)    closure[1](i)

You can pass size as a parameter if it's compile time known to reduce the iterations for the remainder. This only occurs if the remainder is an exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for performance improvements if not an exponent of 2.

Parameters:

• func (fn[Int](Int, /) capturing -> None): The function that will be called in the loop body.
• simd_width (Int): The SIMD vector width.
• unroll_factor (Int): The unroll factor for the main loop (Default 1).

Args:

• size (Int): The upper limit for the loop.

vectorize[func: fn[Int](Int, /) capturing -> None, simd_width: Int, /, *, size: Int, unroll_factor: Int = 1]()

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The remainder of size % simd_width will run in a single iteration if it's an exponent of 2.

The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:

from algorithm.functional import vectorize# The amount of elements to loop throughalias size = 10# How many Dtype.int32 elements fit in SIMD register (4 on 128bit)alias simd_width = simdwidthof[DType.int32]()fn main():    var p = DTypePointer[DType.int32].alloc(size)    # @parameter allows the closure to capture the p pointer    @parameter    fn closure[simd_width: Int](i: Int):        print("storing", simd_width, "els at pos", i)        p.store[width=simd_width](i, i)    vectorize[closure, simd_width, size=size]()    print(p.load[width=size]())

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in a single iteration:

storing 4 els at pos 0storing 4 els at pos 4storing 2 els at pos 8[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]

If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a seperate iteration for each element. However passing size as a parameter also allows the loop for the remaining elements to be unrolled.

You can also unroll the main loop to potentially improve performance at the cost of binary size:

vectorize[closure, width, size=size, unroll_factor=2]()

In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in psuedocode:

closure[4](0)closure[4](4)closure[2](8)

Parameters:

• func (fn[Int](Int, /) capturing -> None): The function that will be called in the loop body.
• simd_width (Int): The SIMD vector width.
• size (Int): The upper limit for the loop.
• unroll_factor (Int): The unroll factor for the main loop (Default 1).