Learn how to write high-performance Mojo code and import Python packages.
Not only is Mojo great for writing high-performance code, but it also allows us to leverage the huge Python ecosystem of libraries and tools. With seamless Python interoperability, Mojo can use Python for what it’s good at, especially GUIs, without sacrificing performance in critical code. Let’s take the classic Mandelbrot set algorithm and implement it in Mojo.
This tutorial shows two aspects of Mojo. First, it shows that Mojo can be used to develop fast programs for irregular applications. It also shows how we can leverage Python for visualizing the results.
The core Mandelbrot algorithm involves computing an iterative complex function for each pixel until it “escapes” the complex circle of radius 2, counting the number of iterations to escape:
\[z_{i+1} = z_i^2 + c\]
# Compute the number of steps to escape.def mandelbrot_kernel(c: ComplexFloat64) -> Int: z = cfor i inrange(MAX_ITERS): z = z * z + cif z.squared_norm() >4:return ireturn MAX_ITERSdef compute_mandelbrot() -> Tensor[float_type]:# create a matrix. Each element of the matrix corresponds to a pixel t = Tensor[float_type](height, width) dx = (max_x - min_x) / width dy = (max_y - min_y) / height y = min_yfor row inrange(height): x = min_xfor col inrange(width): t[Index(row, col)] = mandelbrot_kernel(ComplexFloat64(x, y)) x += dx y += dyreturn t
Plotting the number of iterations to escape with some color gives us the canonical Mandelbrot set plot. To render it we can directly leverage Python’s matplotlib right from Mojo!
We showed a naive implementation of the Mandelbrot algorithm, but there are two things we can do to speed it up. We can early-stop the loop iteration when a pixel is known to have escaped, and we can leverage Mojo’s access to hardware by vectorizing the loop, computing multiple pixels simultaneously. To do that we will use the vectorize higher order generator.
We start by defining our main iteration loop in a vectorized fashion
fn mandelbrot_kernel_SIMD[ simd_width: Int](c: ComplexSIMD[float_type, simd_width]) -> SIMD[float_type, simd_width]:"""A vectorized implementation of the inner mandelbrot computation."""let cx = c.relet cy = c.imvar x = SIMD[float_type, simd_width](0)var y = SIMD[float_type, simd_width](0)var y2 = SIMD[float_type, simd_width](0)var iters = SIMD[float_type, simd_width](0)var t: SIMD[DType.bool, simd_width] =Truefor i inrange(MAX_ITERS):ifnot t.reduce_or():break y2 = y*y y = x.fma(y + y, cy) t = x.fma(x, y2) <=4 x = x.fma(x, cx - y2) iters = t.select(iters +1, iters)return iters
The above function is parameterized on the simd_width and processes simd_width pixels. It only escapes once all pixels within the vector lane are done. We can use the same iteration loop as above, but this time we vectorize within each row instead. We use the vectorize generator to make this a simple function call.
fn vectorized():let t = Tensor[float_type](height, width)@parameterfn worker(row: Int):let scale_x = (max_x - min_x) / widthlet scale_y = (max_y - min_y) / height@parameterfn compute_vector[simd_width: Int](col: Int):"""Each time we oeprate on a `simd_width` vector of pixels."""let cx = min_x + (col + iota[float_type, simd_width]()) * scale_xlet cy = min_y + row * scale_ylet c = ComplexSIMD[float_type, simd_width](cx, cy) t.data().simd_store[simd_width](row * width + col, mandelbrot_kernel_SIMD[simd_width](c))# Vectorize the call to compute_vector where call gets a chunk of pixels. vectorize[simd_width, compute_vector](width)@parameterfn bench[simd_width: Int]():for row inrange(height): worker(row)let vectorized = Benchmark().run[bench[simd_width]]() /1e6print("Vectorized", ":", vectorized, "ms")try: _ = show_plot(t)except e:print("failed to show plot:", e.value)vectorized()
Vectorized : 12.177345000000001 ms
Parallelizing Mandelbrot
While the vectorized implementation above is efficient, we can get better performance by parallelizing on the cols. This again is simple in Mojo using the parallelize higher order function. Only the function that performs the invocation needs to change.
fn parallelized():let t = Tensor[float_type](height, width)@parameterfn worker(row: Int):let scale_x = (max_x - min_x) / widthlet scale_y = (max_y - min_y) / height@parameterfn compute_vector[simd_width: Int](col: Int):"""Each time we oeprate on a `simd_width` vector of pixels."""let cx = min_x + (col + iota[float_type, simd_width]()) * scale_xlet cy = min_y + row * scale_ylet c = ComplexSIMD[float_type, simd_width](cx, cy) t.data().simd_store[simd_width](row * width + col, mandelbrot_kernel_SIMD[simd_width](c))# Vectorize the call to compute_vector where call gets a chunk of pixels. vectorize[simd_width, compute_vector](width)with Runtime() as rt:@parameterfn bench_parallel[simd_width: Int](): parallelize[worker](rt, height, height)let parallelized = Benchmark().run[bench_parallel[simd_width]]() /1e6print("Parallelized:", parallelized, "ms")try: _ = show_plot(t)except e:print("failed to show plot:", e.value)parallelized()
Parallelized: 1.4245639999999999 ms
Benchmarking
In this section we increase the size to 4096x4096 and run 1000 iterations for a larger test to stress the CPU
fn compare():let t = Tensor[float_type](height, width)@parameterfn worker(row: Int):let scale_x = (max_x - min_x) / widthlet scale_y = (max_y - min_y) / height@parameterfn compute_vector[simd_width: Int](col: Int):"""Each time we oeprate on a `simd_width` vector of pixels."""let cx = min_x + (col + iota[float_type, simd_width]()) * scale_xlet cy = min_y + row * scale_ylet c = ComplexSIMD[float_type, simd_width](cx, cy) t.data().simd_store[simd_width](row * width + col, mandelbrot_kernel_SIMD[simd_width](c))# Vectorize the call to compute_vector where call gets a chunk of pixels. vectorize[simd_width, compute_vector](width)@parameterfn bench[simd_width: Int]():for row inrange(height): worker(row)let vectorized = Benchmark().run[bench[simd_width]]() /1e6print("Number of threads:", num_cores())print("Vectorized:", vectorized, "ms")# Parallelizedwith Runtime() as rt:@parameterfn bench_parallel[simd_width: Int](): parallelize[worker](rt, height, height)let parallelized = Benchmark().run[bench_parallel[simd_width]]() /1e6print("Parallelized:", parallelized, "ms")print("Parallel speedup:", vectorized / parallelized) _ = t # Make sure tensor isn't destroyed before benchmark is finished
compare()
Number of threads: 16
Vectorized: 12.171849 ms
Parallelized: 1.3043979999999999 ms
Parallel speedup: 9.3313919524562294