Skip to main content

Mandelbrot in Mojo with Python plots

Not only is Mojo great for writing high-performance code, but it also allows us to leverage the huge Python ecosystem of libraries and tools. With seamless Python interoperability, Mojo can use Python for what it's good at, especially GUIs, without sacrificing performance in critical code. Let's take the classic Mandelbrot set algorithm and implement it in Mojo.

This tutorial shows two aspects of Mojo. First, it shows that Mojo can be used to develop fast programs for irregular applications. It also shows how we can leverage Python for visualizing the results.

Code
import benchmark
from complex import ComplexSIMD, ComplexFloat64
from math import iota
from python import Python
from sys.info import num_physical_cores
from algorithm import parallelize, vectorize
from tensor import Tensor
from utils.index import Index

alias float_type = DType.float64
alias simd_width = 2 * simdwidthof[float_type]()

First set some parameters, you can try changing these to see different results:

alias width = 960
alias height = 960
alias MAX_ITERS = 200

alias min_x = -2.0
alias max_x = 0.6
alias min_y = -1.5
alias max_y = 1.5

The core Mandelbrot algorithm involves computing an iterative complex function for each pixel until it "escapes" the complex circle of radius 2, counting the number of iterations to escape:

zi+1=zi2+cz_{i+1} = z_i^2 + c

# Compute the number of steps to escape.
def mandelbrot_kernel(c: ComplexFloat64) -> Int:
z = c
for i in range(MAX_ITERS):
z = z * z + c
if z.squared_norm() > 4:
return i
return MAX_ITERS


def compute_mandelbrot() -> Tensor[float_type]:
# create a matrix. Each element of the matrix corresponds to a pixel
t = Tensor[float_type](height, width)

dx = (max_x - min_x) / width
dy = (max_y - min_y) / height

y = min_y
for row in range(height):
x = min_x
for col in range(width):
t[Index(row, col)] = mandelbrot_kernel(ComplexFloat64(x, y))
x += dx
y += dy
return t

Plotting the number of iterations to escape with some color gives us the canonical Mandelbrot set plot. To render it we can directly leverage Python's matplotlib right from Mojo!

First install the required libraries:

%%python
from importlib.util import find_spec
import shutil
import subprocess

fix = """
-------------------------------------------------------------------------
fix following the steps here:
https://github.com/modularml/mojo/issues/1085#issuecomment-1771403719
-------------------------------------------------------------------------
"""

def install_if_missing(name: str):
if find_spec(name):
return

print(f"{name} not found, installing...")
try:
if shutil.which('python3'): python = "python3"
elif shutil.which('python'): python = "python"
else: raise ("python not on path" + fix)
subprocess.check_call([python, "-m", "pip", "install", name])
except:
raise ImportError(f"{name} not found" + fix)

install_if_missing("numpy")
install_if_missing("matplotlib")
def show_plot(tensor: Tensor[float_type]):
alias scale = 10
alias dpi = 64

np = Python.import_module("numpy")
plt = Python.import_module("matplotlib.pyplot")
colors = Python.import_module("matplotlib.colors")

numpy_array = np.zeros((height, width), np.float64)

for row in range(height):
for col in range(width):
numpy_array.itemset((col, row), tensor[col, row])

fig = plt.figure(1, [scale, scale * height // width], dpi)
ax = fig.add_axes([0.0, 0.0, 1.0, 1.0], False, 1)
light = colors.LightSource(315, 10, 0, 1, 1, 0)

image = light.shade(numpy_array, plt.cm.hot, colors.PowerNorm(0.3), "hsv", 0, 0, 1.5)
plt.imshow(image)
plt.axis("off")
plt.show()

show_plot(compute_mandelbrot())

png

Vectorizing Mandelbrot

We showed a naive implementation of the Mandelbrot algorithm, but there are two things we can do to speed it up. We can early-stop the loop iteration when a pixel is known to have escaped, and we can leverage Mojo's access to hardware by vectorizing the loop, computing multiple pixels simultaneously. To do that we will use the vectorize higher order generator.

We start by defining our main iteration loop in a vectorized fashion

fn mandelbrot_kernel_SIMD[
simd_width: Int
](c: ComplexSIMD[float_type, simd_width]) -> SIMD[float_type, simd_width]:
"""A vectorized implementation of the inner mandelbrot computation."""
var cx = c.re
var cy = c.im
var x = SIMD[float_type, simd_width](0)
var y = SIMD[float_type, simd_width](0)
var y2 = SIMD[float_type, simd_width](0)
var iters = SIMD[float_type, simd_width](0)

var t: SIMD[DType.bool, simd_width] = True
for i in range(MAX_ITERS):
if not t.reduce_or():
break
y2 = y*y
y = x.fma(y + y, cy)
t = x.fma(x, y2) <= 4
x = x.fma(x, cx - y2)
iters = t.select(iters + 1, iters)
return iters

The above function is parameterized on the simd_width and processes simd_width pixels. It only escapes once all pixels within the vector lane are done. We can use the same iteration loop as above, but this time we vectorize within each row instead. We use the vectorize generator to make this a simple function call.

fn vectorized():
var t = Tensor[float_type](height, width)

@parameter
fn worker(row: Int):
var scale_x = (max_x - min_x) / width
var scale_y = (max_y - min_y) / height

@__copy_capture(scale_x, scale_y)
@parameter
fn compute_vector[simd_width: Int](col: Int):
"""Each time we oeprate on a `simd_width` vector of pixels."""
var cx = min_x + (col + iota[float_type, simd_width]()) * scale_x
var cy = min_y + row * scale_y
var c = ComplexSIMD[float_type, simd_width](cx, cy)
t.data().store(
row * width + col, mandelbrot_kernel_SIMD[simd_width](c)
)

# Vectorize the call to compute_vector where call gets a chunk of pixels.
vectorize[compute_vector, simd_width](width)

@parameter
fn bench[simd_width: Int]():
for row in range(height):
worker(row)

var vectorized = benchmark.run[bench[simd_width]](
max_runtime_secs=0.5
).mean(benchmark.Unit.ms)

print("Vectorized", ":", vectorized, "ms")

try:
_ = show_plot(t)
except e:
print("failed to show plot:", e)

vectorized()
Vectorized : 12.177345000000001 ms

png

Parallelizing Mandelbrot

While the vectorized implementation above is efficient, we can get better performance by parallelizing on the cols. This again is simple in Mojo using the parallelize higher order function. Only the function that performs the invocation needs to change.

fn parallelized():
var t = Tensor[float_type](height, width)

@parameter
fn worker(row: Int):
var scale_x = (max_x - min_x) / width
var scale_y = (max_y - min_y) / height

@__copy_capture(scale_x, scale_y)
@parameter
fn compute_vector[simd_width: Int](col: Int):
"""Each time we oeprate on a `simd_width` vector of pixels."""
var cx = min_x + (col + iota[float_type, simd_width]()) * scale_x
var cy = min_y + row * scale_y
var c = ComplexSIMD[float_type, simd_width](cx, cy)
t.data().store(row * width + col, mandelbrot_kernel_SIMD[simd_width](c))

# Vectorize the call to compute_vector where call gets a chunk of pixels.
vectorize[compute_vector, simd_width](width)


@parameter
fn bench_parallel[simd_width: Int]():
parallelize[worker](height, height)

var parallelized = benchmark.run[bench_parallel[simd_width]](
max_runtime_secs=0.5
).mean(benchmark.Unit.ms)

print("Parallelized:", parallelized, benchmark.Unit.ms)

try:
_ = show_plot(t)
except e:
print("failed to show plot:", e)

parallelized()
Parallelized: 1.4245639999999999 ms

png

Benchmarking

In this section we increase the size to 4096x4096 and run 1000 iterations for a larger test to stress the CPU

fn compare():
var t = Tensor[float_type](height, width)

@parameter
fn worker(row: Int):
var scale_x = (max_x - min_x) / width
var scale_y = (max_y - min_y) / height

@__copy_capture(scale_x, scale_y)
@parameter
fn compute_vector[simd_width: Int](col: Int):
"""Each time we oeprate on a `simd_width` vector of pixels."""
var cx = min_x + (col + iota[float_type, simd_width]()) * scale_x
var cy = min_y + row * scale_y
var c = ComplexSIMD[float_type, simd_width](cx, cy)
t.data().store(
row * width + col, mandelbrot_kernel_SIMD[simd_width](c)
)

# Vectorize the call to compute_vector where call gets a chunk of pixels.
vectorize[compute_vector, simd_width](width)

@parameter
fn bench[simd_width: Int]():
for row in range(height):
worker(row)

var vectorized = benchmark.run[bench[simd_width]](
max_runtime_secs=0.5
).mean(benchmark.Unit.ms)
print("Number of threads:", num_physical_cores())
print("Vectorized:", vectorized, "ms")

# Parallelized
@parameter
fn bench_parallel[simd_width: Int]():
parallelize[worker](height, height)

var parallelized = benchmark.run[bench_parallel[simd_width]](
max_runtime_secs=0.5
).mean(benchmark.Unit.ms)
print("Parallelized:", parallelized, "ms")
print("Parallel speedup:", vectorized / parallelized)

_ = t # Make sure tensor isn't destroyed before benchmark is finished
compare()
Number of threads: 16 Vectorized: 12.171849 ms Parallelized: 1.3043979999999999 ms Parallel speedup: 9.3313919524562294