Skip to main content
Log in

Mojo function

vectorize

vectorize[func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int)

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The remainder of size % simd_width will run in separate iterations.

The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:

from algorithm.functional import vectorize

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()

fn main():
var p = UnsafePointer[Int32].alloc(size)

# @parameter allows the closure to capture the `p` pointer
@parameter
fn closure[simd_width: Int](i: Int):
print("storing", simd_width, "els at pos", i)
p.store[width=simd_width](i, i)

vectorize[closure, simd_width](size)
print(p.load[width=size]())
from algorithm.functional import vectorize

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()

fn main():
var p = UnsafePointer[Int32].alloc(size)

# @parameter allows the closure to capture the `p` pointer
@parameter
fn closure[simd_width: Int](i: Int):
print("storing", simd_width, "els at pos", i)
p.store[width=simd_width](i, i)

vectorize[closure, simd_width](size)
print(p.load[width=size]())

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in two separate iterations:

storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]
storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]

You can also unroll the loop to potentially improve performance at the cost of binary size:

vectorize[closure, width, unroll_factor=2](size)
vectorize[closure, width, unroll_factor=2](size)

In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in psuedocode:

closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
closure[1](i)
closure[1](i)
closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
closure[1](i)
closure[1](i)

You can pass size as a parameter if it's compile time known to reduce the iterations for the remainder. This only occurs if the remainder is an exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for performance improvements if not an exponent of 2.

Parameters:

  • func (fn[Int](Int) capturing -> None): The function that will be called in the loop body.
  • simd_width (Int): The SIMD vector width.
  • unroll_factor (Int): The unroll factor for the main loop (Default 1).

Args:

  • size (Int): The upper limit for the loop.

vectorize[func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, size: Int, unroll_factor: Int = $2 if triple_is_nvidia_cuda() else 1]()

Simplifies SIMD optimized loops by mapping a function across a range from 0 to size, incrementing by simd_width at each step. The remainder of size % simd_width will run in a single iteration if it's an exponent of 2.

The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine:

from algorithm.functional import vectorize

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit in SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()

fn main():
var p = UnsafePointer[Int32].alloc(size)

# @parameter allows the closure to capture the `p` pointer
@parameter
fn closure[simd_width: Int](i: Int):
print("storing", simd_width, "els at pos", i)
p.store[width=simd_width](i, i)

vectorize[closure, simd_width, size=size]()
print(p.load[width=size]())
from algorithm.functional import vectorize

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit in SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()

fn main():
var p = UnsafePointer[Int32].alloc(size)

# @parameter allows the closure to capture the `p` pointer
@parameter
fn closure[simd_width: Int](i: Int):
print("storing", simd_width, "els at pos", i)
p.store[width=simd_width](i, i)

vectorize[closure, simd_width, size=size]()
print(p.load[width=size]())

On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in a single iteration:

storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]
storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]

If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a seperate iteration for each element. However passing size as a parameter also allows the loop for the remaining elements to be unrolled.

You can also unroll the main loop to potentially improve performance at the cost of binary size:

vectorize[closure, width, size=size, unroll_factor=2]()
vectorize[closure, width, size=size, unroll_factor=2]()

In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in psuedocode:

closure[4](0)
closure[4](4)
closure[2](8)
closure[4](0)
closure[4](4)
closure[2](8)

Parameters:

  • func (fn[Int](Int) capturing -> None): The function that will be called in the loop body.
  • simd_width (Int): The SIMD vector width.
  • size (Int): The upper limit for the loop.
  • unroll_factor (Int): The unroll factor for the main loop (Default 1).

Was this page helpful?