Mojo function

vectorize

vectorize[func: fn, //, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int, closure: func)

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The remainder of size % simd_width will run in separate iterations.

The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:

from algorithm.functional import vectorize
from sys import simd_width_of

# The amount of elements to loop through
comptime size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
comptime simd_width = simd_width_of[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = alloc[Int32](size)

    fn closure[width: Int](i: Int) unified {mut}:
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[simd_width](size, closure)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in two separate iterations:

storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]

You can also unroll the loop to potentially improve performance at the cost of binary size:

vectorize[closure, width, unroll_factor=2](size)

In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in pseudocode:

closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
    closure[1](i)
    closure[1](i)

You can pass size as a parameter if it's compile time known to reduce the iterations for the remainder. This only occurs if the remainder is an exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for performance improvements if not an exponent of 2.

Parameters:

func (fn): The function that will be called in the loop body.
simd_width (Int): The SIMD vector width.
unroll_factor (Int): The unroll factor for the main loop (Default 1).

Args:

size (Int): The upper limit for the loop.
closure (func): The captured state of the function bound to func.

vectorize[func: fn, //, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int, closure: func)

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The main loop runs with a fixed SIMD width of simd_width. Any remainder (size % simd_width) is executed with a single final call using predication via the evl (effective vector length) argument.

Compared to vectorize variants that run the remainder as scalar iterations (width=1), this version keeps the SIMD width fixed and passes the number of active lanes in evl for the last (partial) vector. The closure is responsible for honoring evl (e.g. using masked loads/stores) to avoid out-of-bounds accesses.

The below example demonstrates how to set multiple values at the same time using SIMD registers, while handling the tail with evl by generating a mask:

from algorithm.functional import vectorize
from sys import simd_width_of
from math import iota
from sys.intrinsics import masked_store

comptime size = 10
comptime simd_width = simd_width_of[DType.int32]()  # assumed 4 in this example

fn main():
    var p = alloc[Int32](size)

    fn closure[width: Int](i: Int, evl: Int) unified {mut}:
        print("storing", evl, "of", width, "els at pos", i)
        var val = SIMD[DType.int32, width](i)

        # Optimization: Constant propagation eliminates this check in the main loop
        if evl == width:
            p.store[width=width](i, val)
        else:
            # Tail loop: Generate mask from EVL to prevent OOB
            var mask = iota[DType.int32, width]().lt(evl)
            masked_store[width](val, p + i, mask)

    vectorize[simd_width](size, closure)

    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))
    print(p.load[width=2](2 * simd_width))
    p.free()

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each full iteration. The remainder of 10 % 4 is 2, so the tail will be handled by a single call with evl=2:

storing 4 of 4 els at pos 0
storing 4 of 4 els at pos 4
storing 2 of 4 els at pos 8
[0, 0, 0, 0]
[4, 4, 4, 4]
[8, 8]

You can also unroll the main loop to potentially improve performance at the cost of binary size:

vectorize[simd_width, unroll_factor=2](size, closure)

In the generated assembly the full-width calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. In pseudocode:

closure[4](0, 4)
closure[4](4, 4)
closure[4](8, 2)  # single predicated tail call

Notes:

This implementation does not execute the remainder as scalar iterations. The closure must correctly handle evl to keep memory accesses in-bounds.
If size < simd_width, the loop will consist of a single call: closure[simd_width](0, size).

Parameters:

func (fn): The function that will be called in the loop body. It must accept an idx and an evl (effective vector length). For all full SIMD iterations evl == simd_width. For the final partial iteration 0 < evl < simd_width.
simd_width (Int): The SIMD vector width.
unroll_factor (Int): The unroll factor for the main loop (Default 1).

Args:

size (Int): The upper limit for the loop.
closure (func): The captured state of the function bound to func.

vectorize[func: fn, //, simd_width: Int, /, *, size: Int, unroll_factor: Int = size if is_gpu() else 1](closure: func)

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The remainder of size % simd_width will run in a single iteration if it's an exponent of 2.

The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:

from algorithm.functional import vectorize
from sys import simd_width_of

# The amount of elements to loop through
comptime size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
comptime simd_width = simd_width_of[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # The closure can capture the `p` pointer with unified {mut}
    fn closure[width: Int](i: Int) unified {mut}:
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[simd_width](size, closure)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in a single iteration:

storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]

If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a separate iteration for each element. However passing size as a parameter also allows the loop for the remaining elements to be unrolled.

You can also unroll the main loop to potentially improve performance at the cost of binary size:

vectorize[width, size=size, unroll_factor=2](closure)

In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in pseudocode:

closure[4](0)
closure[4](4)
closure[2](8)

Parameters:

func (fn): The function that will be called in the loop body.
simd_width (Int): The SIMD vector width.
size (Int): The upper limit for the loop.
unroll_factor (Int): The unroll factor for the main loop (Default 1).

Args:

closure (func): The captured state of the function bound to func.