Get started with GPU programming
This tutorial introduces you to GPU programming with Mojo. You'll learn how to write a simple program that performs vector addition on a GPU, exploring fundamental concepts of GPU programming along the way.
By the end of this tutorial, you will:
- Understand basic GPU programming concepts like grids and thread blocks.
- Learn how to move data between CPU and GPU memory.
- Write and compile a simple GPU kernel function.
- Execute parallel computations on the GPU.
- Understand the asynchronous nature of GPU programming.
We'll build everything step-by-step, starting with the basics and gradually adding more complexity. The concepts you learn here will serve as a foundation for more advanced GPU programming with Mojo.
System requirements:
Mac
Linux
WSL
GPU
1. Create a Mojo project with magic
We'll start by using the magic
CLI to create a virtual environment
and generate our initial project directory.
-
If you don't have the
magic
CLI yet, you can install it on macOS and Ubuntu Linux with this command:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. -
Navigate to the directory in which you want to create the project and execute:
magic init gpu-intro --format mojoproject
magic init gpu-intro --format mojoproject
This creates a project directory named
gpu-intro
. -
Let's go into the directory and verify the project is configured correctly by checking the version of Mojo that's installed within our project's virtual environment:
cd gpu-intro
cd gpu-intro
magic run mojo --version
magic run mojo --version
You should see a version string indicating the version of Mojo installed, which by default should be the latest nightly version. Because we used the
--format mojoproject
option when creating the project,magic
automatically added themax
package as a dependency, which includes Mojo and the MAX libraries. -
Activate the project's virtual environment:
magic shell
magic shell
Later on, when you want to exit the virtual environment, just type
exit
.
2. Get a reference to the GPU device
The DeviceContext
type
represents a logical instance of a GPU device. It provides methods for
allocating memory on the device, copying data between the host CPU and the GPU,
and compiling and running functions (also known as kernels) on the device.
Use the
DeviceContext()
constructor to get a reference to the GPU device. The constructor raises an
error if no compatible GPU is available. You can use the
has_accelerator()
function to check
if a compatible GPU is available.
So let's start by writing a program that checks if a GPU is available and then
obtains a reference to the GPU device. Using any editor, create a file named
vector_addition.mojo
with the following code:
from gpu.host import DeviceContext
from sys import has_accelerator
def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
ctx = DeviceContext()
print("Found GPU:", ctx.name())
from gpu.host import DeviceContext
from sys import has_accelerator
def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
ctx = DeviceContext()
print("Found GPU:", ctx.name())
Save the file and run it using the mojo
CLI:
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see output like the following (depending on the type of GPU you have):
Found GPU: NVIDIA A10G
Found GPU: NVIDIA A10G
3. Define a simple kernel
A GPU kernel is simply a function that runs on a GPU, executing a specific computation on a large dataset in parallel across thousands or millions of threads. You might already be familiar with threads when programming for a CPU, but GPU threads are different. On a CPU, threads are managed by the operating system and can perform completely independent tasks, such as managing a user interface, fetching data from a database, and so on. But on a GPU, threads are managed by the GPU itself. All the threads on a GPU execute the same kernel function, but they each work on a different part of the data.
When you run a kernel, you need to specify the number of threads you want to use. The number of threads you specify depends on the size of the data you want to process and the amount of parallelism you want to achieve. A common strategy is to use one thread per element of data in the result. So if you're performing an element-wise addition of two 1,024-element vectors, you'd use 1,024 threads.
A grid is the top-level organizational structure for the threads executing a kernel function. A grid consists of multiple thread blocks, which are further divided into individual threads that execute the kernel function concurrently. The GPU assigns a unique block index to each thread block, and a unique thread index to each thread within a block. Threads within the same thread block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. For this tutorial, we won't get in the details of why or how to do this, but it's an important concept to keep in mind when you're writing more complex kernels.
To better understand how grids, thread blocks, and threads are organized, let's
write a simple kernel function that prints the thread block and thread indices.
Add the following code to your vector_addition.mojo
file:
from gpu.id import block_idx, thread_idx
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
from gpu.id import block_idx, thread_idx
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
4. Compile and run the kernel
Next, we need to update the main()
function to compile the kernel function for
our GPU and then run it, specifying the number of thread blocks in the grid and
the number of threads per thread block. For this initial example, let's define a
grid consisting of 2 thread blocks, each with 64 threads. Modify the main()
function so that your program looks like this:
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
ctx = DeviceContext()
ctx.enqueue_function[print_threads](grid_dim=2, block_dim=64)
ctx.synchronize()
print("Program finished")
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
ctx = DeviceContext()
ctx.enqueue_function[print_threads](grid_dim=2, block_dim=64)
ctx.synchronize()
print("Program finished")
Save the file and run it:
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see something like the following output (which is abbreviated here):
Block index: [ 1 ] Thread index: [ 32 ]
Block index: [ 1 ] Thread index: [ 33 ]
Block index: [ 1 ] Thread index: [ 34 ]
...
Block index: [ 0 ] Thread index: [ 30 ]
Block index: [ 0 ] Thread index: [ 31 ]
Program finished
Block index: [ 1 ] Thread index: [ 32 ]
Block index: [ 1 ] Thread index: [ 33 ]
Block index: [ 1 ] Thread index: [ 34 ]
...
Block index: [ 0 ] Thread index: [ 30 ]
Block index: [ 0 ] Thread index: [ 31 ]
Program finished
Typical CPU-GPU interaction is asynchronous, allowing the GPU to process tasks
while the CPU is busy with other work. Each DeviceContext
has an associated
stream of queued operations to execute on the GPU. Operations within a stream
execute in the order they are issued.
The
enqueue_function()
method compiles a kernel function and enqueues it to run on the given device.
You must provide the name of the kernel function as a compile-time Mojo
parameter, and the following arguments:
- Any additional arguments specified by the kernel function definition (none, in this case).
- The grid dimensions using the
grid_dim
keyword argument. - The thread block dimensions using the
block_dim
keyword argument.
(See the Functions section of the Mojo Manual for more information on Mojo function arguments and the Parameters section for more information on Mojo compile-time parameters and metaprogramming.)
We're invoking the compiled kernel function with grid_dim=2
and
block_dim=64
, which means we're using a grid of 2 thread blocks, with 64
threads each for a total of 128 threads in the grid.
When you run a kernel, the GPU assigns each thread block within the grid to a streaming multiprocessor for execution. A streaming multiprocessor (SM) is the fundamental processing unit of a GPU, designed to execute multiple parallel workloads efficiently. Each SM contains several cores, which perform the actual computations of the threads executing on the SM, along with shared resources like registers, shared memory, and control mechanisms to coordinate the execution of threads. The number of SMs and the number of cores on a GPU depends on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with 128 32-bit floating point cores per SM.
Additionally, when an SM is assigned a thread block, it divides the block into multiple warps, which are groups of 32 or 64 threads, depending on the GPU architecture. These threads execute the same instruction simultaneously in a single instruction, multiple threads (SIMT) model. The SM's warp scheduler coordinates the execution of warps on an SM's cores.
Warps are used to efficiently utilize GPU hardware by maximizing throughput and minimizing control overhead. Since GPUs are designed for high-performance parallel processing, grouping threads into warps allows for streamlined instruction scheduling and execution, reducing the complexity of managing individual threads. Multiple warps from multiple thread blocks can be active within an SM at any given time, enabling the GPU to keep execution units busy. For example, if the threads of a particular warp are blocked waiting for data from memory, the warp scheduler can immediately switch execution to another warp that's ready to run.
After enqueuing the kernel function, we want to ensure that the CPU waits for it
to finish execution before exiting the program. We do this by calling the
synchronize()
method of the DeviceContext
object, which blocks until the device completes
all operations in its queue.
5. Manage grid dimensions
The grid in the previous step consisted of a one-dimensional grid of 2 thread blocks with 64 threads in each block. However, you can also organize the thread blocks in a two- or even a three-dimensional grid. Similarly, you can arrange the threads in a thread block across one, two, or three dimensions. Typically, you determine the dimensions of the grid and thread blocks based on the dimensionality of the data to process. For example, you might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video.
To better understand how grids, thread blocks, and threads work together, let's
modify our print_threads()
kernel function to print the x
, y
, and z
components of the thread block and thread indices for each thread.
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x, block_idx.y, block_idx.z,
"]\tThread index: [",
thread_idx.x, thread_idx.y, thread_idx.z,
"]"
)
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x, block_idx.y, block_idx.z,
"]\tThread index: [",
thread_idx.x, thread_idx.y, thread_idx.z,
"]"
)
Then, update main()
to enqueue the kernel function with a 2x2x1 grid of
thread blocks and a 16x4x2 arrangement of threads within each thread block:
ctx.enqueue_function[print_threads](
grid_dim=(2, 2, 1),
block_dim=(16, 4, 2)
)
ctx.enqueue_function[print_threads](
grid_dim=(2, 2, 1),
block_dim=(16, 4, 2)
)
Save the file and run it again:
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see something like the following output (which is abbreviated here):
Block index: [ 1 1 0 ] Thread index: [ 0 2 0 ]
Block index: [ 1 1 0 ] Thread index: [ 1 2 0 ]
Block index: [ 1 1 0 ] Thread index: [ 2 2 0 ]
...
Block index: [ 0 0 0 ] Thread index: [ 14 1 0 ]
Block index: [ 0 0 0 ] Thread index: [ 15 1 0 ]
Program finished
Block index: [ 1 1 0 ] Thread index: [ 0 2 0 ]
Block index: [ 1 1 0 ] Thread index: [ 1 2 0 ]
Block index: [ 1 1 0 ] Thread index: [ 2 2 0 ]
...
Block index: [ 0 0 0 ] Thread index: [ 14 1 0 ]
Block index: [ 0 0 0 ] Thread index: [ 15 1 0 ]
Program finished
Try changing the grid and thread block dimensions to see how the output changes.
Now that you understand how to manage grid dimensions, let's get ready to create a kernel that performs a simple element-wise addition of two vectors of floating point numbers.
6. Allocate host memory for the input vectors
Before creating the two input vectors for our kernel function, we need to understand the distinction between host memory and device memory. Host memory is dynamic random-access memory (DRAM) accessible by the CPU, whereas device memory is DRAM accessible by the GPU. If you have data in host memory, you must explicitly copy it to device memory before you can use it in a kernel function. Similarly, if your kernel function produces data that you want the CPU to use later, you must explicitly copy it back to host memory.
For this tutorial, we'll use the
HostBuffer
type to
represent our vectors on the host. A HostBuffer
is a block of host memory
associated with a particular DeviceContext
. It supports methods for
transferring data between host and device memory, as well as a basic set of
methods for accessing data elements by index and for printing the buffer.
Let's update main()
to create two HostBuffer
s for our input vectors and
initialize them with values. You won't need the print_threads()
kernel
function anymore, so you can remove it and the code to compile and invoke it. So
after all that, your vector_addition.mojo
file should look like this:
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator
# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000
def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
# Get the context for the attached GPU
ctx = DeviceContext()
# Create HostBuffers for input vectors
lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
ctx.synchronize()
# Initialize the input vectors
for i in range(vector_size):
lhs_host_buffer[i] = Float32(i)
rhs_host_buffer[i] = Float32(i * 0.5)
print("LHS buffer: ", lhs_host_buffer)
print("RHS buffer: ", rhs_host_buffer)
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator
# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000
def main():
@parameter
if not has_accelerator():
print("No compatible GPU found")
else:
# Get the context for the attached GPU
ctx = DeviceContext()
# Create HostBuffers for input vectors
lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
ctx.synchronize()
# Initialize the input vectors
for i in range(vector_size):
lhs_host_buffer[i] = Float32(i)
rhs_host_buffer[i] = Float32(i * 0.5)
print("LHS buffer: ", lhs_host_buffer)
print("RHS buffer: ", rhs_host_buffer)
The
enqueue_create_host_buffer()
method accepts the data type as a compile-time parameter and the size of the
buffer as a run-time argument and returns a HostBuffer
. As with all
DeviceContext
methods whose name starts with enqueue_
, the method is
asynchronous and returns immediately, adding the operation to the queue to be
executed by the DeviceContext
. Therefore, we need to call the synchronize()
method to ensure that the operation has completed before we use the HostBuffer
object. Then we can initialize the input vectors with values and print them.
Now let's run the program to verify that everything is working so far.
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see the following output:
LHS buffer: HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
LHS buffer: HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
7. Copy the input vectors to GPU memory and allocate an output vector
Now that we have our input vectors allocated and initialized on the CPU, let's copy them to the GPU so that they'll be available for the kernel function to use. While we're at it, we'll also allocate memory on the GPU for the output vector that will hold the result of the kernel function.
Add the following code to the end of the main()
function:
# Create DeviceBuffers for the input vectors
lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
# Copy the input vectors from the HostBuffers to the DeviceBuffers
ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)
# Create a DeviceBuffer for the result vector
result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
vector_size
)
# Create DeviceBuffers for the input vectors
lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
# Copy the input vectors from the HostBuffers to the DeviceBuffers
ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)
# Create a DeviceBuffer for the result vector
result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
vector_size
)
The DeviceBuffer
type is
analogous to the HostBuffer
type, but represents a block of device memory
associated with a particular DeviceContext
. Specifically, the buffer is
located in the device's global memory space, which is accessible by all
threads executing on the device. As with a HostBuffer
, a DeviceBuffer
is
subject to Mojo's standard ownership and lifecycle mechanisms. It persists until
it is no longer referenced in the program or until the DeviceContext
itself
is destroyed.
The
enqueue_create_buffer()
method accepts the data type as a compile-time parameter and the size of the
buffer as a run-time argument and returns a DeviceBuffer
. The operation is
asynchronous, but we don't need to call the synchronize()
method yet because
we have more operations to add to the queue.
The enqueue_copy()
method is overloaded to support copying from host to device, device to host, or
even device to device for systems that have multiple GPUs. In this example, we
use it to copy the data in our HostBuffer
s to the DeviceBuffer
s.
8. Create LayoutTensor
views
One last step before writing the kernel function is that we're going to create a
LayoutTensor
view for each
of the vectors. LayoutTensor
provides a powerful abstraction for
multi-dimensional data with precise control over memory organization. It
supports various memory layouts (row-major, column-major, tiled),
hardware-specific optimizations, and efficient parallel access patterns.
We don't need all of these features for this tutorial, but in more
complex kernels it's a useful tool for manipulating data. So even though it
isn't strictly necessary for this example, we'll use LayoutTensor
because
you'll see it in more complex examples and it's good to get familiar with it.
First add the following import to the top of the file:
from layout import Layout, LayoutTensor
from layout import Layout, LayoutTensor
A Layout
is a representation of memory
layouts using shape and stride information, and it maps between logical
coordinates and linear memory indices. We'll need to use the same Layout
definition multiple times, so add the following alias to the top of the file
after the other aliases:
alias layout = Layout.row_major(vector_size)
alias layout = Layout.row_major(vector_size)
And finally add the following code to the end of the main()
function
to create LayoutTensor
views for each of the vectors:
# Wrap the DeviceBuffers in LayoutTensors
lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)
# Wrap the DeviceBuffers in LayoutTensors
lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)
9. Define the vector addition kernel function
Now we're ready to write the kernel function. First add the following imports
(note that we've added block_dim
to the list of imports from gpu.id
):
from gpu.id import block_dim, block_idx, thread_idx
from math import ceildiv
from gpu.id import block_dim, block_idx, thread_idx
from math import ceildiv
Then, add the following code to vector_addition.mojo
just before the
main()
function:
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)
fn vector_addition(
lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
"""Calculate the element-wise sum of two vectors on the GPU."""
# Calculate the index of the vector element for the thread to process
var tid = block_idx.x * block_dim.x + thread_idx.x
# Don't process out of bounds elements
if tid < vector_size:
out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)
fn vector_addition(
lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
"""Calculate the element-wise sum of two vectors on the GPU."""
# Calculate the index of the vector element for the thread to process
var tid = block_idx.x * block_dim.x + thread_idx.x
# Don't process out of bounds elements
if tid < vector_size:
out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]
Our vector_addition()
kernel function accepts the two input tensors and the
output tensor as arguments. We also need to know the size of the vector (which
we've defined with the alias vector_size
) because it might not be a multiple
of the block size. In fact in this example, the size of the vector is 1,000,
which is not a multiple of our block size of 256. So as we assign our threads to
read elements from the tensor, we need to make sure we don't overrun the bounds
of the tensor.
The body of the kernel function starts by calculating linear index of the tensor
element that a particular thread is responsible for. The block_dim
object
(which we added to the list of imports) contains the dimensions of the thread
blocks as x
, y
, and z
values. Because we're going to use a one-dimensional
grid of thread blocks, we need only the x
dimension. We can then calculate
tid
, the unique "global" index of the thread within the output tensor as
block_dim.x * block_idx.x + thread_idx.x
. For example, the tid
values for
the threads in the first thread block range from 0 to 255, the tid
values for
the threads in the second thread block range from 256 to 511, and so on.
The function then checks if the calculated tid
is less than the size of the
output tensor. If it is, the thread reads the corresponding elements from the
lhs_tensor
and rhs_tensor
tensors, adds them together, and stores the result
in the corresponding element of the out_tensor
tensor.
10. Invoke the kernel function and copy the output back to the CPU
The last step is to compile and invoke the kernel function, then copy the output
back to the CPU. To do so, add the following code to the end of the main()
function:
# Compile and enqueue the kernel
ctx.enqueue_function[vector_addition](
lhs_tensor,
rhs_tensor,
result_tensor,
grid_dim=num_blocks,
block_dim=block_size,
)
# Create a HostBuffer for the result vector
result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
# Copy the result vector from the DeviceBuffer to the HostBuffer
ctx.enqueue_copy(
dst_buf=result_host_buffer, src_buf=result_device_buffer
)
# Finally, synchronize the DeviceContext to run all enqueued operations
ctx.synchronize()
print("Result vector:", result_host_buffer)
# Compile and enqueue the kernel
ctx.enqueue_function[vector_addition](
lhs_tensor,
rhs_tensor,
result_tensor,
grid_dim=num_blocks,
block_dim=block_size,
)
# Create a HostBuffer for the result vector
result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
vector_size
)
# Copy the result vector from the DeviceBuffer to the HostBuffer
ctx.enqueue_copy(
dst_buf=result_host_buffer, src_buf=result_device_buffer
)
# Finally, synchronize the DeviceContext to run all enqueued operations
ctx.synchronize()
print("Result vector:", result_host_buffer)
The enqueue_function()
method enqueues the compilation and invocation of the
vector_addition()
kernel function, passing the input and output tensors as
arguments. The grid_dim
and block_dim
arguments use the num_blocks
and
block_size
aliases we defined in the previous step.
After the kernel function has been compiled and enqueued, we create a
HostBuffer
to hold the result vector. Then we copy the result vector from the
DeviceBuffer
to the HostBuffer
. Finally, we synchronize the DeviceContext
to run all enqueued operations. After synchronizing, we can print the result
vector to the console.
At this point, the Mojo compiler determines that the DeviceContext
, the
DeviceBuffer
s, the HostBuffer
s, and the LayoutTensor
s are no longer used
and so it automatically invokes their destructors to free their allocated
memory. (For a detailed explanation of object lifetime and destruction in Mojo,
see the Death of a value section of the Mojo
Manual.)
So it's finally time to run the program to see the results of our hard work.
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see the following output:
LHS buffer: HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])
LHS buffer: HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])
And now that you're done with the tutorial, exit your project's virtual environment:
exit
exit
Summary
In this tutorial, we've learned how to use Mojo's gpu.host
package to write a
simple kernel function that performs an element-wise addition of two vectors. We
covered:
- Understanding basic GPU concepts like devices, grids, and thread blocks.
- Moving data between CPU and GPU memory.
- Writing and compiling a GPU kernel function.
- Executing parallel computations on the GPU.
Next steps
Now that you understand the basics of GPU programming with Mojo, here are some suggested next steps:
-
Check out more examples of GPU programming with Mojo in the public Modular GitHub repository.
-
Read the GPU basics section of the Mojo Manual to find out more about GPU programming in Mojo.
-
Read the Introduction to layouts section of the Mojo Manual to learn more about the
layout
package and managing layouts. -
Check out the Mojo Manual for more information on the Mojo language.
-
Learn more about other features of the Modular platform for building and deploying high-performance AI endpoints.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!