Mojo function

vectorize

vectorize[origins: OriginSet, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int)

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The remainder of size % simd_width will run in separate iterations.

The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:

from algorithm.functional import vectorize
from sys import simdwidthof

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # @parameter allows the closure to capture the `p` pointer
    @parameter
    fn closure[width: Int](i: Int):
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[closure, simd_width](size)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))
from algorithm.functional import vectorize
from sys import simdwidthof

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # @parameter allows the closure to capture the `p` pointer
    @parameter
    fn closure[width: Int](i: Int):
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[closure, simd_width](size)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in two separate iterations:

storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]
storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]

You can also unroll the loop to potentially improve performance at the cost of binary size:

vectorize[closure, width, unroll_factor=2](size)

vectorize[closure, width, unroll_factor=2](size)

In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in pseudocode:

closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
    closure[1](i)
    closure[1](i)
closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
    closure[1](i)
    closure[1](i)

You can pass size as a parameter if it's compile time known to reduce the iterations for the remainder. This only occurs if the remainder is an exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for performance improvements if not an exponent of 2.

Parameters:

origins (OriginSet): The capture origins.
func (fn[Int](Int) capturing -> None): The function that will be called in the loop body.
simd_width (Int): The SIMD vector width.
unroll_factor (Int): The unroll factor for the main loop (Default 1).

Args:

size (Int): The upper limit for the loop.

vectorize[origins: OriginSet, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, size: Int, unroll_factor: Int = size if is_nvidia_gpu() else 1]()

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The remainder of size % simd_width will run in a single iteration if it's an exponent of 2.

The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:

from algorithm.functional import vectorize
from sys import simdwidthof

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # @parameter allows the closure to capture the `p` pointer
    @parameter
    fn closure[width: Int](i: Int):
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[closure, simd_width](size)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))
from algorithm.functional import vectorize
from sys import simdwidthof

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # @parameter allows the closure to capture the `p` pointer
    @parameter
    fn closure[width: Int](i: Int):
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[closure, simd_width](size)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in a single iteration:

storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]
storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]

If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a separate iteration for each element. However passing size as a parameter also allows the loop for the remaining elements to be unrolled.

You can also unroll the main loop to potentially improve performance at the cost of binary size:

vectorize[closure, width, size=size, unroll_factor=2]()

vectorize[closure, width, size=size, unroll_factor=2]()

In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in pseudocode:

closure[4](0)
closure[4](4)
closure[2](8)
closure[4](0)
closure[4](4)
closure[2](8)

Parameters:

origins (OriginSet): The capture origins.
func (fn[Int](Int) capturing -> None): The function that will be called in the loop body.
simd_width (Int): The SIMD vector width.
size (Int): The upper limit for the loop.
unroll_factor (Int): The unroll factor for the main loop (Default 1).