Get started with GPU programming

This tutorial introduces you to GPU programming with Mojo. You'll learn how to write a simple program that performs vector addition on a GPU, exploring fundamental concepts of GPU programming along the way.

By the end of this tutorial, you will:

Understand basic GPU programming concepts like grids and thread blocks.
Learn how to move data between CPU and GPU memory.
Write and compile a simple GPU kernel function.
Execute parallel computations on the GPU.
Understand the asynchronous nature of GPU programming.

We'll build everything step-by-step, starting with the basics and gradually adding more complexity. The concepts you learn here will serve as a foundation for more advanced GPU programming with Mojo. If you just want to see the finished code, you can get it on GitHub.

System requirements:

Mac

Linux

WSL

GPU

See Using AI coding assistants for tips on how to configure AI coding assistants to help you write Mojo code.

1. Create a Mojo project

To install the Modular Platform, which includes Mojo, we recommend using pixi (for other options, see the packages guide).

If you don't have pixi, you can install it with this command:

curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh

Navigate to the directory in which you want to create the project and execute:

pixi init gpu-intro \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd gpu-intro
pixi init gpu-intro \
  -c https://conda.modular.com/max-nightly/ -c conda-forge \
  && cd gpu-intro

This creates a project directory named gpu-intro, adds the Modular conda package channel, and enters the directory.

Install the Modular Platform from the modular package (which includes Mojo):
```
pixi add modular
```
```
pixi add modular
```
Let's verify the project is configured correctly by checking the version of Mojo that's installed within our project's virtual environment:
```
pixi run mojo --version
```
```
pixi run mojo --version
```
You should see a version string indicating the version of Mojo installed, which by default should be the latest nightly version.
Activate the project's virtual environment:
```
pixi shell
```
```
pixi shell
```
Later on, when you want to exit the virtual environment, just type exit.

2. Get a reference to the GPU device

The DeviceContext type represents a logical instance of a GPU device. It provides methods for allocating memory on the device, copying data between the host CPU and the GPU, and compiling and running functions (also known as kernels) on the device.

Use the DeviceContext() constructor to get a reference to the GPU device. The constructor raises an error if no compatible GPU is available. You can use the has_accelerator() function to check if a compatible GPU is available.

So let's start by writing a program that checks if a GPU is available and then obtains a reference to the GPU device. Using any editor, create a file named vector_addition.mojo with the following code:

vector_addition.mojo
from gpu.host import DeviceContext
from sys import has_accelerator

def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()
        print("Found GPU:", ctx.name())
from gpu.host import DeviceContext
from sys import has_accelerator

def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()
        print("Found GPU:", ctx.name())

Save the file and run it using the mojo CLI:

mojo vector_addition.mojo

mojo vector_addition.mojo

You should see output like the following (depending on the type of GPU you have):

Found GPU: NVIDIA A10G

Found GPU: NVIDIA A10G

Mojo requires a compatible GPU development environment to compile kernel functions, otherwise it raises a compile-time error. In our code, we're using the @parameter decorator to evaluate the has_accelerator() function at compile time and compile only the corresponding branch of the if statement. As a result, if you don't have a compatible GPU development environment, you'll see the following message when you run the program:

No compatible GPU found

No compatible GPU found

In that case, you need to find a system that has a supported GPU to continue with this tutorial.

3. Define a simple kernel

A GPU kernel is simply a function that runs on a GPU, executing a specific computation on a large dataset in parallel across thousands or millions of threads. You might already be familiar with threads when programming for a CPU, but GPU threads are different. On a CPU, threads are managed by the operating system and can perform completely independent tasks, such as managing a user interface, fetching data from a database, and so on. But on a GPU, threads are managed by the GPU itself. All the threads on a GPU execute the same kernel function, but they each work on a different part of the data.

When you run a kernel, you need to specify the number of threads you want to use. The number of threads you specify depends on the size of the data you want to process and the amount of parallelism you want to achieve. A common strategy is to use one thread per element of data in the result. So if you're performing an element-wise addition of two 1,024-element vectors, you'd use 1,024 threads.

A grid is the top-level organizational structure for the threads executing a kernel function. A grid consists of multiple thread blocks, which are further divided into individual threads that execute the kernel function concurrently. The GPU assigns a unique block index to each thread block, and a unique thread index to each thread within a block. Threads within the same thread block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. For this tutorial, we won't get in the details of why or how to do this, but it's an important concept to keep in mind when you're writing more complex kernels.

To better understand how grids, thread blocks, and threads are organized, let's write a simple kernel function that prints the thread block and thread indices. Add the following code to your vector_addition.mojo file:

vector_addition.mojo
from gpu.id import block_idx, thread_idx

fn print_threads():
    """Print thread IDs."""

    print("Block index: [",
        block_idx.x,
        "]\tThread index: [",
        thread_idx.x,
        "]"
    )
from gpu.id import block_idx, thread_idx

fn print_threads():
    """Print thread IDs."""

    print("Block index: [",
        block_idx.x,
        "]\tThread index: [",
        thread_idx.x,
        "]"
    )

We're using fn here without the raises keyword because a kernel function is not allowed to raise an error condition. In contrast, when you define a Mojo function with def, the compiler always assumes that the function can raise an error condition. See the Functions section of the Mojo Manual for more information on the difference between using fn and def to define functions in Mojo.

4. Compile and run the kernel

Next, we need to update the main() function to compile the kernel function for our GPU and then run it, specifying the number of thread blocks in the grid and the number of threads per thread block. For this initial example, let's define a grid consisting of 2 thread blocks, each with 64 threads. Modify the main() function so that your program looks like this:

vector_addition.mojo
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

fn print_threads():
    """Print thread IDs."""
    print("Block index: [",
        block_idx.x,
        "]\tThread index: [",
        thread_idx.x,
        "]"
    )

def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()
        ctx.enqueue_function[print_threads](grid_dim=2, block_dim=64)
        ctx.synchronize()
        print("Program finished")
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

fn print_threads():
    """Print thread IDs."""
    print("Block index: [",
        block_idx.x,
        "]\tThread index: [",
        thread_idx.x,
        "]"
    )

def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()
        ctx.enqueue_function[print_threads](grid_dim=2, block_dim=64)
        ctx.synchronize()
        print("Program finished")

Save the file and run it:

mojo vector_addition.mojo

mojo vector_addition.mojo

You should see something like the following output (which is abbreviated here):

Block index: [ 1 ]	Thread index: [ 32 ]
Block index: [ 1 ]	Thread index: [ 33 ]
Block index: [ 1 ]	Thread index: [ 34 ]
...
Block index: [ 0 ]	Thread index: [ 30 ]
Block index: [ 0 ]	Thread index: [ 31 ]
Program finished
Block index: [ 1 ]	Thread index: [ 32 ]
Block index: [ 1 ]	Thread index: [ 33 ]
Block index: [ 1 ]	Thread index: [ 34 ]
...
Block index: [ 0 ]	Thread index: [ 30 ]
Block index: [ 0 ]	Thread index: [ 31 ]
Program finished

Typical CPU-GPU interaction is asynchronous, allowing the GPU to process tasks while the CPU is busy with other work. Each DeviceContext has an associated stream of queued operations to execute on the GPU. Operations within a stream execute in the order they are issued.

The enqueue_function() method compiles a kernel function and enqueues it to run on the given device. You must provide the name of the kernel function as a compile-time Mojo parameter, and the following arguments:

Any additional arguments specified by the kernel function definition (none, in this case).
The grid dimensions using the grid_dim keyword argument.
The thread block dimensions using the block_dim keyword argument.

(See the Functions section of the Mojo Manual for more information on Mojo function arguments and the Parameters section for more information on Mojo compile-time parameters and metaprogramming.)

Mojo currently doesn't typecheck the arguments to the compiled kernel function. This means that you can encounter obscure errors if the ordering, types, or argument count doesn't match. We're working to add more robust typechecking soon.

We're invoking the compiled kernel function with grid_dim=2 and block_dim=64, which means we're using a grid of 2 thread blocks, with 64 threads each for a total of 128 threads in the grid.

When you run a kernel, the GPU assigns each thread block within the grid to a streaming multiprocessor for execution. A streaming multiprocessor (SM) is the fundamental processing unit of a GPU, designed to execute multiple parallel workloads efficiently. Each SM contains several cores, which perform the actual computations of the threads executing on the SM, along with shared resources like registers, shared memory, and control mechanisms to coordinate the execution of threads. The number of SMs and the number of cores on a GPU depends on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with 128 32-bit floating point cores per SM.

**Figure 1.** High-level architecture of a streaming multiprocessor (SM). (Click to enlarge.)

Additionally, when an SM is assigned a thread block, it divides the block into multiple warps, which are groups of 32 or 64 threads, depending on the GPU architecture. These threads execute the same instruction simultaneously in a single instruction, multiple threads (SIMT) model. The SM's warp scheduler coordinates the execution of warps on an SM's cores.

**Figure 2.** Hierarchy of threads running on a GPU, showing the relationship of the grid, thread blocks, warps, and individual threads, based on HIP Programming Guide

Warps are used to efficiently utilize GPU hardware by maximizing throughput and minimizing control overhead. Since GPUs are designed for high-performance parallel processing, grouping threads into warps allows for streamlined instruction scheduling and execution, reducing the complexity of managing individual threads. Multiple warps from multiple thread blocks can be active within an SM at any given time, enabling the GPU to keep execution units busy. For example, if the threads of a particular warp are blocked waiting for data from memory, the warp scheduler can immediately switch execution to another warp that's ready to run.

After enqueuing the kernel function, we want to ensure that the CPU waits for it to finish execution before exiting the program. We do this by calling the synchronize() method of the DeviceContext object, which blocks until the device completes all operations in its queue.

5. Manage grid dimensions

The grid in the previous step consisted of a one-dimensional grid of 2 thread blocks with 64 threads in each block. However, you can also organize the thread blocks in a two- or even a three-dimensional grid. Similarly, you can arrange the threads in a thread block across one, two, or three dimensions. Typically, you determine the dimensions of the grid and thread blocks based on the dimensionality of the data to process. For example, you might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video.

**Figure 3.** Organization of thread blocks and threads within a grid.

To better understand how grids, thread blocks, and threads work together, let's modify our print_threads() kernel function to print the x, y, and z components of the thread block and thread indices for each thread.

vector_addition.mojo
fn print_threads():
    """Print thread IDs."""

    print("Block index: [",
        block_idx.x, block_idx.y, block_idx.z,
        "]\tThread index: [",
        thread_idx.x, thread_idx.y, thread_idx.z,
        "]"
    )
fn print_threads():
    """Print thread IDs."""

    print("Block index: [",
        block_idx.x, block_idx.y, block_idx.z,
        "]\tThread index: [",
        thread_idx.x, thread_idx.y, thread_idx.z,
        "]"
    )

Then, update main() to enqueue the kernel function with a 2x2x1 grid of thread blocks and a 16x4x2 arrangement of threads within each thread block:

vector_addition.mojo
        ctx.enqueue_function[print_threads](
            grid_dim=(2, 2, 1),
            block_dim=(16, 4, 2)
        )
        ctx.enqueue_function[print_threads](
            grid_dim=(2, 2, 1),
            block_dim=(16, 4, 2)
        )

Save the file and run it again:

mojo vector_addition.mojo

mojo vector_addition.mojo

You should see something like the following output (which is abbreviated here):

Block index: [ 1 1 0 ]	Thread index: [ 0 2 0 ]
Block index: [ 1 1 0 ]	Thread index: [ 1 2 0 ]
Block index: [ 1 1 0 ]	Thread index: [ 2 2 0 ]
...
Block index: [ 0 0 0 ]	Thread index: [ 14 1 0 ]
Block index: [ 0 0 0 ]	Thread index: [ 15 1 0 ]
Program finished
Block index: [ 1 1 0 ]	Thread index: [ 0 2 0 ]
Block index: [ 1 1 0 ]	Thread index: [ 1 2 0 ]
Block index: [ 1 1 0 ]	Thread index: [ 2 2 0 ]
...
Block index: [ 0 0 0 ]	Thread index: [ 14 1 0 ]
Block index: [ 0 0 0 ]	Thread index: [ 15 1 0 ]
Program finished

Try changing the grid and thread block dimensions to see how the output changes.

The maximum number of threads per thread block and threads per SM is GPU-specific. For example, the NVIDIA A100 GPU has a maximum of 1,024 threads per thread block and 2,048 threads per SM.

Choosing the size and shape of the grid and thread blocks is a balancing act between maximizing the number of threads that can execute concurrently and minimizing the amount of time spent waiting for data to be loaded from memory. Factors such as the size of the data to process, the number of SMs on the GPU, and the memory bandwidth of the GPU can all play a role in determining the optimal grid and thread block dimensions. One general guideline is to choose a thread block size that is a multiple of the warp size. This helps to maximize the utilization of the GPU's resources and minimizes the overhead of managing multiple warps.

Now that you understand how to manage grid dimensions, let's get ready to create a kernel that performs a simple element-wise addition of two vectors of floating point numbers.

6. Allocate host memory for the input vectors

Before creating the two input vectors for our kernel function, we need to understand the distinction between host memory and device memory. Host memory is dynamic random-access memory (DRAM) accessible by the CPU, whereas device memory is DRAM accessible by the GPU. If you have data in host memory, you must explicitly copy it to device memory before you can use it in a kernel function. Similarly, if your kernel function produces data that you want the CPU to use later, you must explicitly copy it back to host memory.

For this tutorial, we'll use the HostBuffer type to represent our vectors on the host. A HostBuffer is a block of host memory associated with a particular DeviceContext. It supports methods for transferring data between host and device memory, as well as a basic set of methods for accessing data elements by index and for printing the buffer.

Let's update main() to create two HostBuffers for our input vectors and initialize them with values. You won't need the print_threads() kernel function anymore, so you can remove it and the code to compile and invoke it. So after all that, your vector_addition.mojo file should look like this:

vector_addition.mojo
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000


def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        # Get the context for the attached GPU
        ctx = DeviceContext()

        # Create HostBuffers for input vectors
        lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        ctx.synchronize()

        # Initialize the input vectors
        for i in range(vector_size):
            lhs_host_buffer[i] = Float32(i)
            rhs_host_buffer[i] = Float32(i * 0.5)

        print("LHS buffer: ", lhs_host_buffer)
        print("RHS buffer: ", rhs_host_buffer)
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000


def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        # Get the context for the attached GPU
        ctx = DeviceContext()

        # Create HostBuffers for input vectors
        lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        ctx.synchronize()

        # Initialize the input vectors
        for i in range(vector_size):
            lhs_host_buffer[i] = Float32(i)
            rhs_host_buffer[i] = Float32(i * 0.5)

        print("LHS buffer: ", lhs_host_buffer)
        print("RHS buffer: ", rhs_host_buffer)

The enqueue_create_host_buffer() method accepts the data type as a compile-time parameter and the size of the buffer as a run-time argument and returns a HostBuffer. As with all DeviceContext methods whose name starts with enqueue_, the method is asynchronous and returns immediately, adding the operation to the queue to be executed by the DeviceContext. Therefore, we need to call the synchronize() method to ensure that the operation has completed before we use the HostBuffer object. Then we can initialize the input vectors with values and print them.

Now let's run the program to verify that everything is working so far.

mojo vector_addition.mojo

mojo vector_addition.mojo

You should see the following output:

LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])

You might notice that we don't explicitly call any methods to free the host memory allocated by our HostBuffers. That's because a HostBuffer is subject to Mojo's standard ownership and lifecycle mechanisms. The Mojo compiler analyzes our program to determine the last point that the owner of or a reference to an object is used and automatically adds a call to the object's destructor. In our program, we last reference the buffers at the end of our program's main() method. However in a more complex program, the HostBuffer could persist across calls to multiple kernel functions if it is referenced at later points in the program. See the Ownership and Intro to value lifecycle sections of the Mojo Manual for more information on Mojo value ownership and value lifecycle management.

7. Copy the input vectors to GPU memory and allocate an output vector

Now that we have our input vectors allocated and initialized on the CPU, let's copy them to the GPU so that they'll be available for the kernel function to use. While we're at it, we'll also allocate memory on the GPU for the output vector that will hold the result of the kernel function.

Add the following code to the end of the main() function:

vector_addition.mojo
        # Create DeviceBuffers for the input vectors
        lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
        rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

        # Copy the input vectors from the HostBuffers to the DeviceBuffers
        ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
        ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

        # Create a DeviceBuffer for the result vector
        result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
            vector_size
        )
        # Create DeviceBuffers for the input vectors
        lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
        rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

        # Copy the input vectors from the HostBuffers to the DeviceBuffers
        ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
        ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

        # Create a DeviceBuffer for the result vector
        result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
            vector_size
        )

The DeviceBuffer type is analogous to the HostBuffer type, but represents a block of device memory associated with a particular DeviceContext. Specifically, the buffer is located in the device's global memory space, which is accessible by all threads executing on the device. As with a HostBuffer, a DeviceBuffer is subject to Mojo's standard ownership and lifecycle mechanisms. It persists until it is no longer referenced in the program or until the DeviceContext itself is destroyed.

The enqueue_create_buffer() method accepts the data type as a compile-time parameter and the size of the buffer as a run-time argument and returns a DeviceBuffer. The operation is asynchronous, but we don't need to call the synchronize() method yet because we have more operations to add to the queue.

The enqueue_copy() method is overloaded to support copying from host to device, device to host, or even device to device for systems that have multiple GPUs. In this example, we use it to copy the data in our HostBuffers to the DeviceBuffers.

Both DeviceBuffer and HostBuffer also include enqueue_copy_to() and enqueue_copy_from() methods. These are simply convenience methods that call the enqueue_copy() method on their corresponding DeviceContext. Therefore, we could have written the copy operations in the previous example with the following equivalent code:

    lhs_host_buffer.enqueue_copy_to(dst=lhs_device_buffer)
    rhs_host_buffer.enqueue_copy_to(dst=rhs_device_buffer)
    lhs_host_buffer.enqueue_copy_to(dst=lhs_device_buffer)
    rhs_host_buffer.enqueue_copy_to(dst=rhs_device_buffer)

8. Create `LayoutTensor` views

One last step before writing the kernel function is that we're going to create a LayoutTensor view for each of the vectors. LayoutTensor provides a powerful abstraction for multi-dimensional data with precise control over memory organization. It supports various memory layouts (row-major, column-major, tiled), hardware-specific optimizations, and efficient parallel access patterns. We don't need all of these features for this tutorial, but in more complex kernels it's a useful tool for manipulating data. So even though it isn't strictly necessary for this example, we'll use LayoutTensor because you'll see it in more complex examples and it's good to get familiar with it.

First add the following import to the top of the file:

vector_addition.mojo
from layout import Layout, LayoutTensor
from layout import Layout, LayoutTensor

A Layout is a representation of memory layouts using shape and stride information, and it maps between logical coordinates and linear memory indices. We'll need to use the same Layout definition multiple times, so add the following alias to the top of the file after the other aliases:

vector_addition.mojo
alias layout = Layout.row_major(vector_size)
alias layout = Layout.row_major(vector_size)

And finally add the following code to the end of the main() function to create LayoutTensor views for each of the vectors:

vector_addition.mojo
        # Wrap the DeviceBuffers in LayoutTensors
        lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
        rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
        result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)
        # Wrap the DeviceBuffers in LayoutTensors
        lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
        rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
        result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)

9. Define the vector addition kernel function

Now we're ready to write the kernel function. First add the following imports (note that we've added block_dim to the list of imports from gpu.id):

vector_addition.mojo
from gpu.id import block_dim, block_idx, thread_idx
from math import ceildiv
from gpu.id import block_dim, block_idx, thread_idx
from math import ceildiv

Then, add the following code to vector_addition.mojo just before the main() function:

vector_addition.mojo
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)


fn vector_addition(
    lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid < vector_size:
        out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)


fn vector_addition(
    lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid < vector_size:
        out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]

Our vector_addition() kernel function accepts the two input tensors and the output tensor as arguments. We also need to know the size of the vector (which we've defined with the alias vector_size) because it might not be a multiple of the block size. In fact in this example, the size of the vector is 1,000, which is not a multiple of our block size of 256. So as we assign our threads to read elements from the tensor, we need to make sure we don't overrun the bounds of the tensor.

The body of the kernel function starts by calculating linear index of the tensor element that a particular thread is responsible for. The block_dim object (which we added to the list of imports) contains the dimensions of the thread blocks as x, y, and z values. Because we're going to use a one-dimensional grid of thread blocks, we need only the x dimension. We can then calculate tid, the unique "global" index of the thread within the output tensor as block_dim.x * block_idx.x + thread_idx.x. For example, the tid values for the threads in the first thread block range from 0 to 255, the tid values for the threads in the second thread block range from 256 to 511, and so on.

As a convenience, the gpu.id module includes a global_idx alias that contains the unique "global" x, y, and z indices of the thread within the grid of thread blocks. So for our one-dimensional grid of one-dimensional thread blocks, global_idx.x is equivalent to the value of tid that we calculated above. However for this tutorial, it's best that you learn how to calculate tid manually so that you understand how the grid and thread block dimensions work.

The function then checks if the calculated tid is less than the size of the output tensor. If it is, the thread reads the corresponding elements from the lhs_tensor and rhs_tensor tensors, adds them together, and stores the result in the corresponding element of the out_tensor tensor.

10. Invoke the kernel function and copy the output back to the CPU

The last step is to compile and invoke the kernel function, then copy the output back to the CPU. To do so, add the following code to the end of the main() function:

vector_addition.mojo
        # Compile and enqueue the kernel
        ctx.enqueue_function[vector_addition](
            lhs_tensor,
            rhs_tensor,
            result_tensor,
            grid_dim=num_blocks,
            block_dim=block_size,
        )

        # Create a HostBuffer for the result vector
        result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )

        # Copy the result vector from the DeviceBuffer to the HostBuffer
        ctx.enqueue_copy(
            dst_buf=result_host_buffer, src_buf=result_device_buffer
        )

        # Finally, synchronize the DeviceContext to run all enqueued operations
        ctx.synchronize()

        print("Result vector:", result_host_buffer)
        # Compile and enqueue the kernel
        ctx.enqueue_function[vector_addition](
            lhs_tensor,
            rhs_tensor,
            result_tensor,
            grid_dim=num_blocks,
            block_dim=block_size,
        )

        # Create a HostBuffer for the result vector
        result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )

        # Copy the result vector from the DeviceBuffer to the HostBuffer
        ctx.enqueue_copy(
            dst_buf=result_host_buffer, src_buf=result_device_buffer
        )

        # Finally, synchronize the DeviceContext to run all enqueued operations
        ctx.synchronize()

        print("Result vector:", result_host_buffer)

Click here to see the complete version of vector_addition.mojo.

vector_addition.mojo
from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from layout import Layout, LayoutTensor
from math import ceildiv
from sys import has_accelerator

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000
alias layout = Layout.row_major(vector_size)

# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)


fn vector_addition(
    lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid < vector_size:
        out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]


def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        # Get the context for the attached GPU
        ctx = DeviceContext()

        # Create HostBuffers for input vectors
        lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        ctx.synchronize()

        # Initialize the input vectors
        for i in range(vector_size):
            lhs_host_buffer[i] = Float32(i)
            rhs_host_buffer[i] = Float32(i * 0.5)

        print("LHS buffer: ", lhs_host_buffer)
        print("RHS buffer: ", rhs_host_buffer)

        # Create DeviceBuffers for the input vectors
        lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
        rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

        # Copy the input vectors from the HostBuffers to the DeviceBuffers
        ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
        ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

        # Create a DeviceBuffer for the result vector
        result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
            vector_size
        )

        # Wrap the DeviceBuffers in LayoutTensors
        lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
        rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
        result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)

        # Compile and enqueue the kernel
        ctx.enqueue_function[vector_addition](
            lhs_tensor,
            rhs_tensor,
            result_tensor,
            grid_dim=num_blocks,
            block_dim=block_size,
        )

        # Create a HostBuffer for the result vector
        result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )

        # Copy the result vector from the DeviceBuffer to the HostBuffer
        ctx.enqueue_copy(
            dst_buf=result_host_buffer, src_buf=result_device_buffer
        )

        # Finally, synchronize the DeviceContext to run all enqueued operations
        ctx.synchronize()

        print("Result vector:", result_host_buffer)
from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from layout import Layout, LayoutTensor
from math import ceildiv
from sys import has_accelerator

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000
alias layout = Layout.row_major(vector_size)

# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)


fn vector_addition(
    lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid < vector_size:
        out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]


def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        # Get the context for the attached GPU
        ctx = DeviceContext()

        # Create HostBuffers for input vectors
        lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        ctx.synchronize()

        # Initialize the input vectors
        for i in range(vector_size):
            lhs_host_buffer[i] = Float32(i)
            rhs_host_buffer[i] = Float32(i * 0.5)

        print("LHS buffer: ", lhs_host_buffer)
        print("RHS buffer: ", rhs_host_buffer)

        # Create DeviceBuffers for the input vectors
        lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
        rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

        # Copy the input vectors from the HostBuffers to the DeviceBuffers
        ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
        ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

        # Create a DeviceBuffer for the result vector
        result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
            vector_size
        )

        # Wrap the DeviceBuffers in LayoutTensors
        lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
        rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
        result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)

        # Compile and enqueue the kernel
        ctx.enqueue_function[vector_addition](
            lhs_tensor,
            rhs_tensor,
            result_tensor,
            grid_dim=num_blocks,
            block_dim=block_size,
        )

        # Create a HostBuffer for the result vector
        result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )

        # Copy the result vector from the DeviceBuffer to the HostBuffer
        ctx.enqueue_copy(
            dst_buf=result_host_buffer, src_buf=result_device_buffer
        )

        # Finally, synchronize the DeviceContext to run all enqueued operations
        ctx.synchronize()

        print("Result vector:", result_host_buffer)

The enqueue_function() method enqueues the compilation and invocation of the vector_addition() kernel function, passing the input and output tensors as arguments. The grid_dim and block_dim arguments use the num_blocks and block_size aliases we defined in the previous step.

After the kernel function has been compiled and enqueued, we create a HostBuffer to hold the result vector. Then we copy the result vector from the DeviceBuffer to the HostBuffer. Finally, we synchronize the DeviceContext to run all enqueued operations. After synchronizing, we can print the result vector to the console.

At this point, the Mojo compiler determines that the DeviceContext, the DeviceBuffers, the HostBuffers, and the LayoutTensors are no longer used and so it automatically invokes their destructors to free their allocated memory. (For a detailed explanation of object lifetime and destruction in Mojo, see the Death of a value section of the Mojo Manual.)

So it's finally time to run the program to see the results of our hard work.

mojo vector_addition.mojo

mojo vector_addition.mojo

You should see the following output:

LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])

And now that you're done with the tutorial, exit your project's virtual environment:

exit

exit

Summary

In this tutorial, we've learned how to use Mojo's gpu.host package to write a simple kernel function that performs an element-wise addition of two vectors. We covered:

Understanding basic GPU concepts like devices, grids, and thread blocks.
Moving data between CPU and GPU memory.
Writing and compiling a GPU kernel function.
Executing parallel computations on the GPU.

Next steps

Now that you understand the basics of GPU programming with Mojo, here are some suggested next steps:

See Using AI coding assistants for tips on how to configure AI coding assistants to help you write Mojo code.
Read GPU programming fundamentals and subsequent sections of the Mojo Manual to explore more GPU programming features available in Mojo.
Learn more about GPU programming in Mojo and practice your skills by solving the Mojo GPU puzzles.
Check out more GPU programming examples on GitHub.
Read the Introduction to layouts section of the Mojo Manual to learn more about the layout package and managing layouts.

1. Create a Mojo project​

2. Get a reference to the GPU device​

3. Define a simple kernel​

4. Compile and run the kernel​

5. Manage grid dimensions​

6. Allocate host memory for the input vectors​

7. Copy the input vectors to GPU memory and allocate an output vector​

8. Create LayoutTensor views​

9. Define the vector addition kernel function​

10. Invoke the kernel function and copy the output back to the CPU​

Summary​

Next steps​