Skip to main content
Log in

Get started with GPU programming

This tutorial introduces you to GPU programming with Mojo. You'll learn how to write a simple program that performs vector addition on a GPU, exploring fundamental concepts of GPU programming along the way.

By the end of this tutorial, you will:

  • Understand basic GPU programming concepts like grids and thread blocks.
  • Learn how to move data between CPU and GPU memory.
  • Write and compile a simple GPU kernel function.
  • Execute parallel computations on the GPU.
  • Understand the asynchronous nature of GPU programming.

We'll build everything step-by-step, starting with the basics and gradually adding more complexity. The concepts you learn here will serve as a foundation for more advanced GPU programming with Mojo.

System requirements:

1. Create a Mojo project with magic

We'll start by using the magic CLI to create a virtual environment and generate our initial project directory.

  1. If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

    curl -ssL https://magic.modular.com/ | bash
    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Navigate to the directory in which you want to create the project and execute:

    magic init gpu-intro --format mojoproject
    magic init gpu-intro --format mojoproject

    This creates a project directory named gpu-intro.

  3. Let's go into the directory and verify the project is configured correctly by checking the version of Mojo that's installed within our project's virtual environment:

    cd gpu-intro
    cd gpu-intro
    magic run mojo --version
    magic run mojo --version

    You should see a version string indicating the version of Mojo installed, which by default should be the latest nightly version. Because we used the --format mojoproject option when creating the project, magic automatically added the max package as a dependency, which includes Mojo and the MAX libraries.

  4. Activate the project's virtual environment:

    magic shell
    magic shell

    Later on, when you want to exit the virtual environment, just type exit.

2. Get a reference to the GPU device

The DeviceContext type represents a logical instance of a GPU device. It provides methods for allocating memory on the device, copying data between the host CPU and the GPU, and compiling and running functions (also known as kernels) on the device.

Use the DeviceContext() constructor to get a reference to the GPU device. The constructor raises an error if no compatible GPU is available. You can use the has_accelerator() function to check if a compatible GPU is available.

So let's start by writing a program that checks if a GPU is available and then obtains a reference to the GPU device. Using any editor, create a file named vector_addition.mojo with the following code:

vector_addition.mojo
from gpu.host import DeviceContext
from sys import has_accelerator

def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
ctx = DeviceContext()
print("Found GPU:", ctx.name())
from gpu.host import DeviceContext
from sys import has_accelerator

def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
ctx = DeviceContext()
print("Found GPU:", ctx.name())

Save the file and run it using the mojo CLI:

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see output like the following (depending on the type of GPU you have):

Found GPU: NVIDIA A10G
Found GPU: NVIDIA A10G

3. Define a simple kernel

A GPU kernel is simply a function that runs on a GPU, executing a specific computation on a large dataset in parallel across thousands or millions of threads. You might already be familiar with threads when programming for a CPU, but GPU threads are different. On a CPU, threads are managed by the operating system and can perform completely independent tasks, such as managing a user interface, fetching data from a database, and so on. But on a GPU, threads are managed by the GPU itself. All the threads on a GPU execute the same kernel function, but they each work on a different part of the data.

When you run a kernel, you need to specify the number of threads you want to use. The number of threads you specify depends on the size of the data you want to process and the amount of parallelism you want to achieve. A common strategy is to use one thread per element of data in the result. So if you're performing an element-wise addition of two 1,024-element vectors, you'd use 1,024 threads.

A grid is the top-level organizational structure for the threads executing a kernel function. A grid consists of multiple thread blocks, which are further divided into individual threads that execute the kernel function concurrently. The GPU assigns a unique block index to each thread block, and a unique thread index to each thread within a block. Threads within the same thread block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. For this tutorial, we won't get in the details of why or how to do this, but it's an important concept to keep in mind when you're writing more complex kernels.

To better understand how grids, thread blocks, and threads are organized, let's write a simple kernel function that prints the thread block and thread indices. Add the following code to your vector_addition.mojo file:

vector_addition.mojo
from gpu.id import block_idx, thread_idx

fn print_threads():
"""Print thread IDs."""

print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
from gpu.id import block_idx, thread_idx

fn print_threads():
"""Print thread IDs."""

print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)

4. Compile and run the kernel

Next, we need to update the main() function to compile the kernel function for our GPU and then run it, specifying the number of thread blocks in the grid and the number of threads per thread block. For this initial example, let's define a grid consisting of 2 thread blocks, each with 64 threads. Modify the main() function so that your program looks like this:

vector_addition.mojo
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)

def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
ctx = DeviceContext()
ctx.enqueue_function[print_threads](grid_dim=2, block_dim=64)
ctx.synchronize()
print("Program finished")
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)

def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
ctx = DeviceContext()
ctx.enqueue_function[print_threads](grid_dim=2, block_dim=64)
ctx.synchronize()
print("Program finished")

Save the file and run it:

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see something like the following output (which is abbreviated here):

Block index: [ 1 ]	Thread index: [ 32 ]
Block index: [ 1 ] Thread index: [ 33 ]
Block index: [ 1 ] Thread index: [ 34 ]
...
Block index: [ 0 ] Thread index: [ 30 ]
Block index: [ 0 ] Thread index: [ 31 ]
Program finished
Block index: [ 1 ]	Thread index: [ 32 ]
Block index: [ 1 ] Thread index: [ 33 ]
Block index: [ 1 ] Thread index: [ 34 ]
...
Block index: [ 0 ] Thread index: [ 30 ]
Block index: [ 0 ] Thread index: [ 31 ]
Program finished

Typical CPU-GPU interaction is asynchronous, allowing the GPU to process tasks while the CPU is busy with other work. Each DeviceContext has an associated stream of queued operations to execute on the GPU. Operations within a stream execute in the order they are issued.

The enqueue_function() method compiles a kernel function and enqueues it to run on the given device. You must provide the name of the kernel function as a compile-time Mojo parameter, and the following arguments:

  • Any additional arguments specified by the kernel function definition (none, in this case).
  • The grid dimensions using the grid_dim keyword argument.
  • The thread block dimensions using the block_dim keyword argument.

(See the Functions section of the Mojo Manual for more information on Mojo function arguments and the Parameters section for more information on Mojo compile-time parameters and metaprogramming.)

We're invoking the compiled kernel function with grid_dim=2 and block_dim=64, which means we're using a grid of 2 thread blocks, with 64 threads each for a total of 128 threads in the grid.

When you run a kernel, the GPU assigns each thread block within the grid to a streaming multiprocessor for execution. A streaming multiprocessor (SM) is the fundamental processing unit of a GPU, designed to execute multiple parallel workloads efficiently. Each SM contains several cores, which perform the actual computations of the threads executing on the SM, along with shared resources like registers, shared memory, and control mechanisms to coordinate the execution of threads. The number of SMs and the number of cores on a GPU depends on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with 128 32-bit floating point cores per SM.

Additionally, when an SM is assigned a thread block, it divides the block into multiple warps, which are groups of 32 or 64 threads, depending on the GPU architecture. These threads execute the same instruction simultaneously in a single instruction, multiple threads (SIMT) model. The SM's warp scheduler coordinates the execution of warps on an SM's cores.

Warps are used to efficiently utilize GPU hardware by maximizing throughput and minimizing control overhead. Since GPUs are designed for high-performance parallel processing, grouping threads into warps allows for streamlined instruction scheduling and execution, reducing the complexity of managing individual threads. Multiple warps from multiple thread blocks can be active within an SM at any given time, enabling the GPU to keep execution units busy. For example, if the threads of a particular warp are blocked waiting for data from memory, the warp scheduler can immediately switch execution to another warp that's ready to run.

After enqueuing the kernel function, we want to ensure that the CPU waits for it to finish execution before exiting the program. We do this by calling the synchronize() method of the DeviceContext object, which blocks until the device completes all operations in its queue.

5. Manage grid dimensions

The grid in the previous step consisted of a one-dimensional grid of 2 thread blocks with 64 threads in each block. However, you can also organize the thread blocks in a two- or even a three-dimensional grid. Similarly, you can arrange the threads in a thread block across one, two, or three dimensions. Typically, you determine the dimensions of the grid and thread blocks based on the dimensionality of the data to process. For example, you might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video.

To better understand how grids, thread blocks, and threads work together, let's modify our print_threads() kernel function to print the x, y, and z components of the thread block and thread indices for each thread.

vector_addition.mojo
fn print_threads():
"""Print thread IDs."""

print("Block index: [",
block_idx.x, block_idx.y, block_idx.z,
"]\tThread index: [",
thread_idx.x, thread_idx.y, thread_idx.z,
"]"
)
fn print_threads():
"""Print thread IDs."""

print("Block index: [",
block_idx.x, block_idx.y, block_idx.z,
"]\tThread index: [",
thread_idx.x, thread_idx.y, thread_idx.z,
"]"
)

Then, update main() to enqueue the kernel function with a 2x2x1 grid of thread blocks and a 16x4x2 arrangement of threads within each thread block:

vector_addition.mojo
        ctx.enqueue_function[print_threads](
grid_dim=(2, 2, 1),
block_dim=(16, 4, 2)
)
        ctx.enqueue_function[print_threads](
grid_dim=(2, 2, 1),
block_dim=(16, 4, 2)
)

Save the file and run it again:

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see something like the following output (which is abbreviated here):

Block index: [ 1 1 0 ]	Thread index: [ 0 2 0 ]
Block index: [ 1 1 0 ] Thread index: [ 1 2 0 ]
Block index: [ 1 1 0 ] Thread index: [ 2 2 0 ]
...
Block index: [ 0 0 0 ] Thread index: [ 14 1 0 ]
Block index: [ 0 0 0 ] Thread index: [ 15 1 0 ]
Program finished
Block index: [ 1 1 0 ]	Thread index: [ 0 2 0 ]
Block index: [ 1 1 0 ] Thread index: [ 1 2 0 ]
Block index: [ 1 1 0 ] Thread index: [ 2 2 0 ]
...
Block index: [ 0 0 0 ] Thread index: [ 14 1 0 ]
Block index: [ 0 0 0 ] Thread index: [ 15 1 0 ]
Program finished

Try changing the grid and thread block dimensions to see how the output changes.

Now that you understand how to manage grid dimensions, let's get ready to create a kernel that performs a simple element-wise addition of two vectors of floating point numbers.

6. Allocate host memory for the input vectors

Before creating the two input vectors for our kernel function, we need to understand the distinction between host memory and device memory. Host memory is dynamic random-access memory (DRAM) accessible by the CPU, whereas device memory is DRAM accessible by the GPU. If you have data in host memory, you must explicitly copy it to device memory before you can use it in a kernel function. Similarly, if your kernel function produces data that you want the CPU to use later, you must explicitly copy it back to host memory.

For this tutorial, we'll use the HostBuffer type to represent our vectors on the host. A HostBuffer is a block of host memory associated with a particular DeviceContext. It supports methods for transferring data between host and device memory, as well as a basic set of methods for accessing data elements by index and for printing the buffer.

Let's update main() to create two HostBuffers for our input vectors and initialize them with values. You won't need the print_threads() kernel function anymore, so you can remove it and the code to compile and invoke it. So after all that, your vector_addition.mojo file should look like this:

vector_addition.mojo
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000


def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
# Get the context for the attached GPU
ctx = DeviceContext()

# Create HostBuffers for input vectors
lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
ctx.synchronize()

# Initialize the input vectors
for i in range(vector_size):
lhs_host_buffer[i] = Float32(i)
rhs_host_buffer[i] = Float32(i * 0.5)

print("LHS buffer: ", lhs_host_buffer)
print("RHS buffer: ", rhs_host_buffer)
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000


def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
# Get the context for the attached GPU
ctx = DeviceContext()

# Create HostBuffers for input vectors
lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
ctx.synchronize()

# Initialize the input vectors
for i in range(vector_size):
lhs_host_buffer[i] = Float32(i)
rhs_host_buffer[i] = Float32(i * 0.5)

print("LHS buffer: ", lhs_host_buffer)
print("RHS buffer: ", rhs_host_buffer)

The enqueue_create_host_buffer() method accepts the data type as a compile-time parameter and the size of the buffer as a run-time argument and returns a HostBuffer. As with all DeviceContext methods whose name starts with enqueue_, the method is asynchronous and returns immediately, adding the operation to the queue to be executed by the DeviceContext. Therefore, we need to call the synchronize() method to ensure that the operation has completed before we use the HostBuffer object. Then we can initialize the input vectors with values and print them.

Now let's run the program to verify that everything is working so far.

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see the following output:

LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])

7. Copy the input vectors to GPU memory and allocate an output vector

Now that we have our input vectors allocated and initialized on the CPU, let's copy them to the GPU so that they'll be available for the kernel function to use. While we're at it, we'll also allocate memory on the GPU for the output vector that will hold the result of the kernel function.

Add the following code to the end of the main() function:

vector_addition.mojo
        # Create DeviceBuffers for the input vectors
lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

# Copy the input vectors from the HostBuffers to the DeviceBuffers
ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

# Create a DeviceBuffer for the result vector
result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
vector_size
)
        # Create DeviceBuffers for the input vectors
lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

# Copy the input vectors from the HostBuffers to the DeviceBuffers
ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

# Create a DeviceBuffer for the result vector
result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
vector_size
)

The DeviceBuffer type is analogous to the HostBuffer type, but represents a block of device memory associated with a particular DeviceContext. Specifically, the buffer is located in the device's global memory space, which is accessible by all threads executing on the device. As with a HostBuffer, a DeviceBuffer is subject to Mojo's standard ownership and lifecycle mechanisms. It persists until it is no longer referenced in the program or until the DeviceContext itself is destroyed.

The enqueue_create_buffer() method accepts the data type as a compile-time parameter and the size of the buffer as a run-time argument and returns a DeviceBuffer. The operation is asynchronous, but we don't need to call the synchronize() method yet because we have more operations to add to the queue.

The enqueue_copy() method is overloaded to support copying from host to device, device to host, or even device to device for systems that have multiple GPUs. In this example, we use it to copy the data in our HostBuffers to the DeviceBuffers.

8. Create LayoutTensor views

One last step before writing the kernel function is that we're going to create a LayoutTensor view for each of the vectors. LayoutTensor provides a powerful abstraction for multi-dimensional data with precise control over memory organization. It supports various memory layouts (row-major, column-major, tiled), hardware-specific optimizations, and efficient parallel access patterns. We don't need all of these features for this tutorial, but in more complex kernels it's a useful tool for manipulating data. So even though it isn't strictly necessary for this example, we'll use LayoutTensor because you'll see it in more complex examples and it's good to get familiar with it.

First add the following import to the top of the file:

vector_addition.mojo
from layout import Layout, LayoutTensor
from layout import Layout, LayoutTensor

A Layout is a representation of memory layouts using shape and stride information, and it maps between logical coordinates and linear memory indices. We'll need to use the same Layout definition multiple times, so add the following alias to the top of the file after the other aliases:

vector_addition.mojo
alias layout = Layout.row_major(vector_size)
alias layout = Layout.row_major(vector_size)

And finally add the following code to the end of the main() function to create LayoutTensor views for each of the vectors:

vector_addition.mojo
        # Wrap the DeviceBuffers in LayoutTensors
lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)
        # Wrap the DeviceBuffers in LayoutTensors
lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)

9. Define the vector addition kernel function

Now we're ready to write the kernel function. First add the following imports (note that we've added block_dim to the list of imports from gpu.id):

vector_addition.mojo
from gpu.id import block_dim, block_idx, thread_idx
from math import ceildiv
from gpu.id import block_dim, block_idx, thread_idx
from math import ceildiv

Then, add the following code to vector_addition.mojo just before the main() function:

vector_addition.mojo
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)


fn vector_addition(
lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
"""Calculate the element-wise sum of two vectors on the GPU."""

# Calculate the index of the vector element for the thread to process
var tid = block_idx.x * block_dim.x + thread_idx.x

# Don't process out of bounds elements
if tid < vector_size:
out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)


fn vector_addition(
lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
"""Calculate the element-wise sum of two vectors on the GPU."""

# Calculate the index of the vector element for the thread to process
var tid = block_idx.x * block_dim.x + thread_idx.x

# Don't process out of bounds elements
if tid < vector_size:
out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]

Our vector_addition() kernel function accepts the two input tensors and the output tensor as arguments. We also need to know the size of the vector (which we've defined with the alias vector_size) because it might not be a multiple of the block size. In fact in this example, the size of the vector is 1,000, which is not a multiple of our block size of 256. So as we assign our threads to read elements from the tensor, we need to make sure we don't overrun the bounds of the tensor.

The body of the kernel function starts by calculating linear index of the tensor element that a particular thread is responsible for. The block_dim object (which we added to the list of imports) contains the dimensions of the thread blocks as x, y, and z values. Because we're going to use a one-dimensional grid of thread blocks, we need only the x dimension. We can then calculate tid, the unique "global" index of the thread within the output tensor as block_dim.x * block_idx.x + thread_idx.x. For example, the tid values for the threads in the first thread block range from 0 to 255, the tid values for the threads in the second thread block range from 256 to 511, and so on.

The function then checks if the calculated tid is less than the size of the output tensor. If it is, the thread reads the corresponding elements from the lhs_tensor and rhs_tensor tensors, adds them together, and stores the result in the corresponding element of the out_tensor tensor.

10. Invoke the kernel function and copy the output back to the CPU

The last step is to compile and invoke the kernel function, then copy the output back to the CPU. To do so, add the following code to the end of the main() function:

vector_addition.mojo
        # Compile and enqueue the kernel
ctx.enqueue_function[vector_addition](
lhs_tensor,
rhs_tensor,
result_tensor,
grid_dim=num_blocks,
block_dim=block_size,
)

# Create a HostBuffer for the result vector
result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)

# Copy the result vector from the DeviceBuffer to the HostBuffer
ctx.enqueue_copy(
dst_buf=result_host_buffer, src_buf=result_device_buffer
)

# Finally, synchronize the DeviceContext to run all enqueued operations
ctx.synchronize()

print("Result vector:", result_host_buffer)
        # Compile and enqueue the kernel
ctx.enqueue_function[vector_addition](
lhs_tensor,
rhs_tensor,
result_tensor,
grid_dim=num_blocks,
block_dim=block_size,
)

# Create a HostBuffer for the result vector
result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)

# Copy the result vector from the DeviceBuffer to the HostBuffer
ctx.enqueue_copy(
dst_buf=result_host_buffer, src_buf=result_device_buffer
)

# Finally, synchronize the DeviceContext to run all enqueued operations
ctx.synchronize()

print("Result vector:", result_host_buffer)

The enqueue_function() method enqueues the compilation and invocation of the vector_addition() kernel function, passing the input and output tensors as arguments. The grid_dim and block_dim arguments use the num_blocks and block_size aliases we defined in the previous step.

After the kernel function has been compiled and enqueued, we create a HostBuffer to hold the result vector. Then we copy the result vector from the DeviceBuffer to the HostBuffer. Finally, we synchronize the DeviceContext to run all enqueued operations. After synchronizing, we can print the result vector to the console.

At this point, the Mojo compiler determines that the DeviceContext, the DeviceBuffers, the HostBuffers, and the LayoutTensors are no longer used and so it automatically invokes their destructors to free their allocated memory. (For a detailed explanation of object lifetime and destruction in Mojo, see the Death of a value section of the Mojo Manual.)

So it's finally time to run the program to see the results of our hard work.

mojo vector_addition.mojo
mojo vector_addition.mojo

You should see the following output:

LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])

And now that you're done with the tutorial, exit your project's virtual environment:

exit
exit

Summary

In this tutorial, we've learned how to use Mojo's gpu.host package to write a simple kernel function that performs an element-wise addition of two vectors. We covered:

  • Understanding basic GPU concepts like devices, grids, and thread blocks.
  • Moving data between CPU and GPU memory.
  • Writing and compiling a GPU kernel function.
  • Executing parallel computations on the GPU.

Next steps

Now that you understand the basics of GPU programming with Mojo, here are some suggested next steps:

  • Check out more examples of GPU programming with Mojo in the public Modular GitHub repository.

  • Read the GPU basics section of the Mojo Manual to find out more about GPU programming in Mojo.

  • Read the Introduction to layouts section of the Mojo Manual to learn more about the layout package and managing layouts.

  • Check out the Mojo Manual for more information on the Mojo language.

  • Learn more about other features of the Modular platform for building and deploying high-performance AI endpoints.

Was this page helpful?