Skip to main content
Log in

Get started with GPU programming with Mojo and the MAX Driver

This tutorial introduces you to GPU programming using Mojo and the MAX Driver. You'll learn how to write a simple program that performs vector addition on a GPU, exploring fundamental concepts of GPU programming along the way.

By the end of this tutorial, you will:

  • Understand basic GPU programming concepts like grids and thread blocks
  • Learn how to move data between CPU and GPU memory
  • Write and compile a simple GPU kernel function
  • Execute parallel computations on the GPU

We'll build everything step-by-step, starting with the basics and gradually adding more complexity. The concepts you learn here will serve as a foundation for more advanced GPU programming with Mojo.

System requirements:

1. Create a Mojo project with magic

We'll start by using the magic CLI to create a virtual environment and generate our initial project directory.

  1. If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

    curl -ssL https://magic.modular.com/ | bash
    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Navigate to the directory in which you want to create the project and execute:

    magic init gpu-intro --format mojoproject
    magic init gpu-intro --format mojoproject

    This creates a project directory named gpu-intro.

  3. Let's go into the directory and verify the project is configured correctly by checking the version of Mojo that's installed within our project's virtual environment:

    cd gpu-intro
    cd gpu-intro
    magic run mojo --version
    magic run mojo --version

    You should see a version string indicating the version of Mojo installed, which by default should be the latest nightly version. Because we used the --format mojoproject option when creating the project, magic automatically added the max package as a dependency, which includes Mojo and the MAX libraries.

  4. Activate the project's virtual environment:

    magic shell
    magic shell

    Later on, when you want to exit the virtual environment, just type exit.

2. Get references to the CPU and GPU

When using the MAX Driver, the Device class represents a logical instance of a device, for example, a CPU or GPU. The cpu() and accelerator() functions return a Device reference to the CPU and GPU, respectively. If no GPU is available, the accelerator() function raises an error. You can use the has_accelerator() function to check if a GPU is available.

So let's start by writing a program that checks if a GPU is available and then creates a CPU and GPU device. Using any editor, create a file named vector_addition.mojo with the following code:

vector_addition.mojo
from max.driver import accelerator, cpu
from sys import exit, has_accelerator

def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()

host_device = cpu()
print("Found the CPU device")
gpu_device = accelerator()
print("Found the GPU device")
from max.driver import accelerator, cpu
from sys import exit, has_accelerator

def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()

host_device = cpu()
print("Found the CPU device")
gpu_device = accelerator()
print("Found the GPU device")

Save the file and run it using the mojo CLI:

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see the following output:

Found the CPU device
Found the GPU device
Found the CPU device
Found the GPU device

3. Define a simple kernel

A GPU kernel is simply a function that runs on a GPU, executing a specific computation on a large dataset in parallel across thousands or millions of threads. You might already be familiar with threads when programming for a CPU, but GPU threads are different. On a CPU, threads are managed by the operating system and can perform completely independent tasks, such as managing a user interface, fetching data from a database, and so on. But on a GPU, threads are managed by the GPU itself. All the threads on a GPU execute the same kernel function, but they each work on a different part of the data.

When you run a kernel, you need to specify the number of threads you want to use. The number of threads you specify depends on the size of the data you want to process and the amount of parallelism you want to achieve. A common strategy is to use one thread per element of data in the result. So if you're performing an elementwise addition of two 1,024-element vectors, you'd use 1,024 threads.

A grid is the top-level organizational structure for the threads executing a kernel function. A grid consists of multiple thread blocks, which are further divided into individual threads that execute the kernel function concurrently. The GPU assigns a unique block index to each thread block, and a unique thread index to each thread within a block. Threads within the same thread block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. For this tutorial, we won't get in the details of why or how to do this, but it's an important concept to keep in mind when you're writing more complex kernels.

To better understand how grids, thread blocks, and threads work together, let's write a simple kernel function that prints the thread block and thread indices. Add the following code to your vector_addition.mojo file:

vector_addition.mojo
from gpu.id import block_idx, thread_idx

fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
from gpu.id import block_idx, thread_idx

fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)

4. Compile and run the kernel

Next, we need to update the main() function to compile the kernel function for our GPU and then run it, specifying the number of thread blocks in the grid and the number of threads per thread block. For this initial example, let's define a grid consisting of 2 thread blocks, each with 64 threads. Update the imports and modify the main() function so that your program looks like this:

vector_addition.mojo
from gpu.host import Dim
from gpu.id import block_idx, thread_idx
from max.driver import Accelerator, Device, accelerator, cpu
from sys import exit, has_accelerator

fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)

def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()

host_device = cpu()
print("Found the CPU device")
gpu_device = accelerator()
print("Found the GPU device")

print_threads_gpu = Accelerator.compile[print_threads](gpu_device)

print_threads_gpu(gpu_device, grid_dim=Dim(2), block_dim=Dim(64))

# Required for now to keep the main thread alive until the GPU is done
Device.wait_for(gpu_device)
print("Program finished")
from gpu.host import Dim
from gpu.id import block_idx, thread_idx
from max.driver import Accelerator, Device, accelerator, cpu
from sys import exit, has_accelerator

fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)

def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()

host_device = cpu()
print("Found the CPU device")
gpu_device = accelerator()
print("Found the GPU device")

print_threads_gpu = Accelerator.compile[print_threads](gpu_device)

print_threads_gpu(gpu_device, grid_dim=Dim(2), block_dim=Dim(64))

# Required for now to keep the main thread alive until the GPU is done
Device.wait_for(gpu_device)
print("Program finished")

Save the file and run it:

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see something like the following output (which is abbreviated here):

Found the CPU device
Found the GPU device
Block index: [ 1 ] Thread index: [ 0 ]
Block index: [ 1 ] Thread index: [ 1 ]
Block index: [ 1 ] Thread index: [ 2 ]
...
Block index: [ 0 ] Thread index: [ 30 ]
Block index: [ 0 ] Thread index: [ 31 ]
Program finished
Found the CPU device
Found the GPU device
Block index: [ 1 ] Thread index: [ 0 ]
Block index: [ 1 ] Thread index: [ 1 ]
Block index: [ 1 ] Thread index: [ 2 ]
...
Block index: [ 0 ] Thread index: [ 30 ]
Block index: [ 0 ] Thread index: [ 31 ]
Program finished

The Accelerator.compile() function compiles a kernel function so that it can run on a particular GPU architecture. You specify the name of the kernel function as a compile-time Mojo parameter and the target GPU device as a run-time Mojo argument. (See the Functions section of the Mojo Manual for more information on Mojo function arguments and the Parameters section for more information on Mojo compile-time parameters and metaprogramming.)

When you invoke the compiled kernel function, the MAX Driver executes it asynchronously on the GPU. You must provide the following arguments in this order:

  • The target GPU device
  • Any additional arguments specified by the kernel function definition (none, in this case)
  • The grid dimensions, using the grid_dim keyword argument
  • The thread block dimensions, using the block_dim keyword argument

We're invoking the compiled kernel function with grid_dim=Dim(2) and block_dim=Dim(64), which means we're using a grid of 2 thread blocks, with 64 threads each for a total of 128 threads in the grid.

When you run a kernel, the GPU assigns each thread block within the grid to a streaming multiprocessor for execution. A streaming multiprocessor (SM) is the fundamental processing unit of a GPU, designed to execute multiple parallel workloads efficiently. Each SM contains several cores, which perform the actual computations of the threads executing on the SM, along with shared resources like registers, shared memory, and control mechanisms to coordinate the execution of threads. The number of SMs and the number of cores on a GPU depends on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with 128 32-bit floating point cores per SM.

Additionally, when an SM is assigned a thread block, it divides the block into multiple warps, which are groups of 32 or 64 threads, depending on the GPU architecture. These threads execute the same instruction simultaneously in a single instruction, multiple threads (SIMT) model. The SM's warp scheduler coordinates the execution of warps on an SM's cores.

Warps are used to efficiently utilize GPU hardware by maximizing throughput and minimizing control overhead. Since GPUs are designed for high-performance parallel processing, grouping threads into warps allows for streamlined instruction scheduling and execution, reducing the complexity of managing individual threads. Multiple warps from multiple thread blocks can be active within an SM at any given time, enabling the GPU to keep execution units busy. For example, if the threads of a particular warp are blocked waiting for data from memory, the warp scheduler can immediately switch execution to another warp that's ready to run.

5. Manage grid dimensions

The grid in the previous step consisted of a one-dimensional grid of 2 thread blocks with 64 threads in each block. However, you can also organize the thread blocks in a two- or even a three-dimensional grid. Similarly, you can arrange the threads in a thread block across one, two, or three dimensions. Typically, you determine the dimensions of the grid and thread blocks based on the dimensionality of the data to process. For example, you might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video.

To better understand how grids, thread blocks, and threads work together, let's modify our print_threads() kernel function to print the x, y, and z components of the thread block and thread indices for each thread.

vector_addition.mojo
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x, block_idx.y, block_idx.z,
"]\tThread index: [",
thread_idx.x, thread_idx.y, thread_idx.z,
"]"
)
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x, block_idx.y, block_idx.z,
"]\tThread index: [",
thread_idx.x, thread_idx.y, thread_idx.z,
"]"
)

Then, update the main() function to invoke the compiled kernel function with a 2x2x1 grid of thread blocks and a 16x4x2 arrangement of threads within each thread block:

vector_addition.mojo
    print_threads_gpu(
gpu_device,
grid_dim=Dim(2, 2, 1),
block_dim=Dim(16, 4, 2)
)
    print_threads_gpu(
gpu_device,
grid_dim=Dim(2, 2, 1),
block_dim=Dim(16, 4, 2)
)

Save the file and run it again:

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see something like the following output (which is abbreviated here):

Found the CPU device
Found the GPU device
Block index: [ 0 1 0 ] Thread index: [ 0 2 0 ]
Block index: [ 0 1 0 ] Thread index: [ 1 2 0 ]
Block index: [ 0 1 0 ] Thread index: [ 2 2 0 ]
...
Block index: [ 1 1 0 ] Thread index: [ 14 3 0 ]
Block index: [ 1 1 0 ] Thread index: [ 15 3 0 ]
Program finished
Found the CPU device
Found the GPU device
Block index: [ 0 1 0 ] Thread index: [ 0 2 0 ]
Block index: [ 0 1 0 ] Thread index: [ 1 2 0 ]
Block index: [ 0 1 0 ] Thread index: [ 2 2 0 ]
...
Block index: [ 1 1 0 ] Thread index: [ 14 3 0 ]
Block index: [ 1 1 0 ] Thread index: [ 15 3 0 ]
Program finished

Try changing the grid and thread block dimensions to see how the output changes.

6. Model our data using Tensor

Now that you understand how to manage grid dimensions, let's get ready to create a kernel that performs a simple elementwise addition of two vectors of floating point numbers.

We'll start by determining how to represent our data. Although we're going to be using one-dimensional data, we'll use a data type called a tensor that's capable of representing multi-dimensional data — basically, a multi-dimensional array. Because we have only a single dimension of data, this will be a rank-1 tensor.

We'll use the Tensor class from the max.driver package to represent our data. This Tensor class is a convenience class that can allocate memory for the tensor on either the CPU or the GPU. It also includes methods for moving and copying the data between the CPU and the GPU.

Let's add Tensor to the list of imports and then update main() to create two input tensors on the CPU. We won't need the print_threads() kernel function anymore, so we can remove it and the code to compile and invoke it. So after all that, your vector_addition.mojo file should look like this:

vector_addition.mojo
from gpu.host import Dim
from gpu.id import block_idx, thread_idx
from max.driver import Accelerator, Tensor, accelerator, cpu
from sys import exit, has_accelerator

alias float_dtype = DType.float32
alias tensor_rank = 1
alias vector_size = 100

def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()

host_device = cpu()
gpu_device = accelerator()

# Allocate the two input tensors on the host.
lhs_tensor = Tensor[float_dtype, tensor_rank](vector_size, host_device)
rhs_tensor = Tensor[float_dtype, tensor_rank](vector_size, host_device)

# Fill them with initial values.
for i in range(vector_size):
lhs_tensor[i] = Float32(i)
rhs_tensor[i] = Float32(i * 0.5)

print("lhs_tensor:", lhs_tensor)
print("rhs_tensor:", rhs_tensor)
from gpu.host import Dim
from gpu.id import block_idx, thread_idx
from max.driver import Accelerator, Tensor, accelerator, cpu
from sys import exit, has_accelerator

alias float_dtype = DType.float32
alias tensor_rank = 1
alias vector_size = 100

def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()

host_device = cpu()
gpu_device = accelerator()

# Allocate the two input tensors on the host.
lhs_tensor = Tensor[float_dtype, tensor_rank](vector_size, host_device)
rhs_tensor = Tensor[float_dtype, tensor_rank](vector_size, host_device)

# Fill them with initial values.
for i in range(vector_size):
lhs_tensor[i] = Float32(i)
rhs_tensor[i] = Float32(i * 0.5)

print("lhs_tensor:", lhs_tensor)
print("rhs_tensor:", rhs_tensor)

The program starts by defining some compile-time aliases for the data type and the size of the vector we're going to process. Then, the main() function initializes two input tensors on the CPU.

The Tensor constructor has two compile-time parameters:

  • type: The element data type. We're specifying float_dtype, our alias for DType.float32.

  • rank: The rank of the tensor. We're specifying tensor_rank, our alias for 1.

We're also passing two run-time arguments to the constructor:

  • shape: The shape of the tensor, which is the size of each dimension of the tensor. You can provide an instance of the TensorShape struct, a Mojo tuple of integers, or for a rank-1 tensor a single integer. Here, we're providing a single integer, vector_size, which is our alias for 100.

  • device: The device on which to allocate the tensor. If you omit this argument, the tensor is allocated on the host by default. Here, we're explicitly specifying host_device to make our intent clear.

After allocating the tensors, we fill them with initial values and then print them.

Now let's run the program to verify that everything is working so far.

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see the following output:

lhs_tensor: Tensor([[0.0, 1.0, 2.0, ..., 97.0, 98.0, 99.0]], dtype=float32, shape=100)
rhs_tensor: Tensor([[0.0, 0.5, 1.0, ..., 48.5, 49.0, 49.5]], dtype=float32, shape=100)
lhs_tensor: Tensor([[0.0, 1.0, 2.0, ..., 97.0, 98.0, 99.0]], dtype=float32, shape=100)
rhs_tensor: Tensor([[0.0, 0.5, 1.0, ..., 48.5, 49.0, 49.5]], dtype=float32, shape=100)

7. Move the input tensors to the GPU and allocate an output tensor

Now that we have our input tensors allocated and initialized on the CPU, let's move them to the GPU so that they'll be available for the kernel function to use.

Add the following code to the end of the main() function:

vector_addition.mojo
    # Move the input tensors to the accelerator.
lhs_tensor = lhs_tensor.move_to(gpu_device)
rhs_tensor = rhs_tensor.move_to(gpu_device)
    # Move the input tensors to the accelerator.
lhs_tensor = lhs_tensor.move_to(gpu_device)
rhs_tensor = rhs_tensor.move_to(gpu_device)

The move_to() method returns a new Tensor object that is allocated on the specified device. It also implicitly calls the destructor on the original Tensor object, freeing the memory associated with it. The Mojo compiler would report an error if you tried to use the original Tensor object after moving it to the GPU. We could declare new variables to hold the moved tensors, but in this example we'll just reuse the original names.

Next, let's allocate an output tensor on the GPU to hold the result of the kernel function. Add the following code to the end of the main() function:

vector_addition.mojo
    # Allocate the output tensor on the accelerator.
out_tensor = Tensor[float_dtype, tensor_rank](vector_size, gpu_device)
    # Allocate the output tensor on the accelerator.
out_tensor = Tensor[float_dtype, tensor_rank](vector_size, gpu_device)

8. Create LayoutTensor views

One last step before writing the kernel function is that we're going to create a LayoutTensor view for each of the tensors. LayoutTensor provides a powerful abstraction for multi-dimensional data with precise control over memory organization. It supports various memory layouts (row-major, column-major, tiled), hardware-specific optimizations, and efficient parallel access patterns. We won't go into the details of using LayoutTensor in this tutorial, but in more complex kernels it's a useful tool for manipulating your data.

All we need to do is add the following code to the end of the main() function:

vector_addition.mojo
    # Create a LayoutTensor for each tensor.
lhs_layout_tensor = lhs_tensor.to_layout_tensor()
rhs_layout_tensor = rhs_tensor.to_layout_tensor()
out_layout_tensor = out_tensor.to_layout_tensor()
    # Create a LayoutTensor for each tensor.
lhs_layout_tensor = lhs_tensor.to_layout_tensor()
rhs_layout_tensor = rhs_tensor.to_layout_tensor()
out_layout_tensor = out_tensor.to_layout_tensor()

9. Define and compile the vector addition kernel function

Now we're ready to write the kernel function. Add the following code to vector_addition.mojo:

vector_addition.mojo
from gpu.id import block_dim, block_idx, thread_idx
from layout import LayoutTensor, Layout

...

alias block_size = 32

fn vector_addition[
lhs_layout: Layout,
rhs_layout: Layout,
out_layout: Layout,
](
lhs: LayoutTensor[float_dtype, lhs_layout, MutableAnyOrigin],
rhs: LayoutTensor[float_dtype, rhs_layout, MutableAnyOrigin],
out: LayoutTensor[float_dtype, out_layout, MutableAnyOrigin],
):
"""The calculation to perform across the vector on the GPU."""

alias size = out_layout.size() # Force compile-time evaluation.
tid = block_dim.x * block_idx.x + thread_idx.x
if tid < size:
out[tid] = lhs[tid] + rhs[tid]
from gpu.id import block_dim, block_idx, thread_idx
from layout import LayoutTensor, Layout

...

alias block_size = 32

fn vector_addition[
lhs_layout: Layout,
rhs_layout: Layout,
out_layout: Layout,
](
lhs: LayoutTensor[float_dtype, lhs_layout, MutableAnyOrigin],
rhs: LayoutTensor[float_dtype, rhs_layout, MutableAnyOrigin],
out: LayoutTensor[float_dtype, out_layout, MutableAnyOrigin],
):
"""The calculation to perform across the vector on the GPU."""

alias size = out_layout.size() # Force compile-time evaluation.
tid = block_dim.x * block_idx.x + thread_idx.x
if tid < size:
out[tid] = lhs[tid] + rhs[tid]

Our vector_addition() kernel function accepts the two input tensors and the output tensor as arguments. It also accepts compile-time Layout parameters for each of the tensors. (For example, lhs_layout is the inferred layout for the lhs argument.) A Layout is a representation of memory layouts using shape and stride information, and it maps between logical coordinates and linear memory indices. In our kernel function, we use only the layout of the out tensor to determine the size of the vector.

It's important to know the size of the vector because it might not be a multiple of the block size. In fact in this example, the size of the vector is 100, which is not a multiple of our block size of 32. So as we assign our threads to read elements from the tensor, we need to make sure we don't overrun the bounds of the tensor.

The body of the kernel function starts by calculating linear index of the tensor element that a particular thread is responsible for. The block_dim object (which we added to the list of imports) contains the dimensions of the thread blocks as x, y, and z values. Because we're going to use a one-dimensional grid of thread blocks, we need only the x dimension. We can then calculate tid, the unique "global" index of the thread within the output tensor as block_dim.x * block_idx.x + thread_idx.x. For example, the tid values for the threads in the first thread block range from 0 to 31. The tid values for the threads in the second thread block range from 32 to 63, and so on.

The function then checks if the calculated tid is less than the size of the output tensor. If it is, the thread reads the corresponding elements from the lhs and rhs tensors, adds them together, and stores the result in the corresponding element of the out tensor.

Now that we've written the kernel function, we can compile it by adding the following code to the end of the main() function:

vector_addition.mojo
    # Compile the kernel function to run on the GPU.
gpu_function = Accelerator.compile[
vector_addition[
lhs_layout_tensor.layout,
rhs_layout_tensor.layout,
out_layout_tensor.layout,
]
](gpu_device)
    # Compile the kernel function to run on the GPU.
gpu_function = Accelerator.compile[
vector_addition[
lhs_layout_tensor.layout,
rhs_layout_tensor.layout,
out_layout_tensor.layout,
]
](gpu_device)

10. Invoke the kernel function and move the output back to the CPU

The last step is to invoke the kernel function and move the output back to the CPU. Add this line to the list of imports at the top of the file:

vector_addition.mojo
from math import ceildiv
from math import ceildiv

Then, add the following code to the end of the main() function:

vector_addition.mojo
    # Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
num_blocks = ceildiv(vector_size, block_size)

# Invoke the kernel function.
gpu_function(
gpu_device,
lhs_layout_tensor,
rhs_layout_tensor,
out_layout_tensor,
grid_dim=Dim(num_blocks),
block_dim=Dim(block_size),
)

# Move the output tensor back onto the CPU so that we can read the results.
out_tensor = out_tensor.move_to(host_device)

print("out_tensor:", out_tensor)
    # Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
num_blocks = ceildiv(vector_size, block_size)

# Invoke the kernel function.
gpu_function(
gpu_device,
lhs_layout_tensor,
rhs_layout_tensor,
out_layout_tensor,
grid_dim=Dim(num_blocks),
block_dim=Dim(block_size),
)

# Move the output tensor back onto the CPU so that we can read the results.
out_tensor = out_tensor.move_to(host_device)

print("out_tensor:", out_tensor)

First we calculate the number of thread blocks needed by dividing the vector size by the block size and rounding up. Then we can invoke the kernel function.

After that, we move the output tensor back onto the CPU so that we can read the results. This call blocks on the CPU until the kernel function has populated the output tensor and returned. The move also has the side effect of invoking the destructor for the original out_tensor object and freeing its allocated memory on the GPU. As for the input tensors, we don't need to move them back to the CPU. Also, the Mojo compiler determines that the lhs_tensor and rhs_tensor objects are no longer needed after the kernel function has returned, and so it automatically invokes their destructors to free their allocated memory on the GPU. (For a detailed explanation of object lifetime and destruction in Mojo, see the Death of a value section of the Mojo Manual.)

So it's finally time to run the program to see the results of our hard work.

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see the following output:

lhs_tensor:  Tensor([[0.0, 1.0, 2.0, ..., 97.0, 98.0, 99.0]], dtype=float32, shape=100)
rhs_tensor: Tensor([[0.0, 0.5, 1.0, ..., 48.5, 49.0, 49.5]], dtype=float32, shape=100)
out_tensor: Tensor([[0.0, 1.5, 3.0, ..., 145.5, 147.0, 148.5]], dtype=float32, shape=100)
lhs_tensor:  Tensor([[0.0, 1.0, 2.0, ..., 97.0, 98.0, 99.0]], dtype=float32, shape=100)
rhs_tensor: Tensor([[0.0, 0.5, 1.0, ..., 48.5, 49.0, 49.5]], dtype=float32, shape=100)
out_tensor: Tensor([[0.0, 1.5, 3.0, ..., 145.5, 147.0, 148.5]], dtype=float32, shape=100)

And now that you're done with the tutorial, exit your project's virtual environment:

exit
exit

Summary

In this tutorial, we've learned how to use the MAX Driver to write a simple kernel function that performs an elementwise addition of two vectors. We covered:

  • Understanding basic GPU concepts like devices, grids, and thread blocks
  • Moving data between CPU and GPU memory using tensors
  • Writing and compiling a GPU kernel function
  • Executing parallel computations on the GPU
  • Managing memory and object lifetimes across devices

Now that you understand the basics of GPU programming with Mojo, here are some suggested next steps:

  • Check out more examples of GPU programming with Mojo and the MAX Driver in the public MAX GitHub repository.

  • Try implementing other parallel algorithms like matrix multiplication or convolutions.

  • Explore the MAX Driver API documentation to discover more advanced GPU programming features.

  • Learn more about other features of the MAX platform for building and deploying high-performance AI endpoints.

  • Read the GPU basics section of the Mojo Manual to get a taste of the low-level GPU programming APIs available in the gpu package.

  • Check out the Mojo Manual for more information on the Mojo language.

Was this page helpful?