Get started with GPU programming with Mojo and the MAX Driver
This tutorial introduces you to GPU programming using Mojo and the MAX Driver. You'll learn how to write a simple program that performs vector addition on a GPU, exploring fundamental concepts of GPU programming along the way.
By the end of this tutorial, you will:
- Understand basic GPU programming concepts like grids and thread blocks
- Learn how to move data between CPU and GPU memory
- Write and compile a simple GPU kernel function
- Execute parallel computations on the GPU
We'll build everything step-by-step, starting with the basics and gradually adding more complexity. The concepts you learn here will serve as a foundation for more advanced GPU programming with Mojo.
System requirements:
Mac
Linux
WSL
GPU
1. Create a Mojo project with magic
We'll start by using the magic
CLI to create a virtual environment
and generate our initial project directory.
-
If you don't have the
magic
CLI yet, you can install it on macOS and Ubuntu Linux with this command:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. -
Navigate to the directory in which you want to create the project and execute:
magic init gpu-intro --format mojoproject
magic init gpu-intro --format mojoproject
This creates a project directory named
gpu-intro
. -
Let's go into the directory and verify the project is configured correctly by checking the version of Mojo that's installed within our project's virtual environment:
cd gpu-intro
cd gpu-intro
magic run mojo --version
magic run mojo --version
You should see a version string indicating the version of Mojo installed, which by default should be the latest nightly version. Because we used the
--format mojoproject
option when creating the project,magic
automatically added themax
package as a dependency, which includes Mojo and the MAX libraries. -
Activate the project's virtual environment:
magic shell
magic shell
Later on, when you want to exit the virtual environment, just type
exit
.
2. Get references to the CPU and GPU
When using the MAX Driver, the Device
class represents a logical instance of a device, for example, a CPU or GPU. The
cpu()
and accelerator()
functions return
a Device
reference to the CPU and GPU, respectively. If no GPU is available,
the accelerator()
function raises an error. You can use the
has_accelerator()
function to check
if a GPU is available.
So let's start by writing a program that checks if a GPU is available and then
creates a CPU and GPU device. Using any editor, create a file named
vector_addition.mojo
with the following code:
from max.driver import accelerator, cpu
from sys import exit, has_accelerator
def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()
host_device = cpu()
print("Found the CPU device")
gpu_device = accelerator()
print("Found the GPU device")
from max.driver import accelerator, cpu
from sys import exit, has_accelerator
def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()
host_device = cpu()
print("Found the CPU device")
gpu_device = accelerator()
print("Found the GPU device")
Save the file and run it using the mojo
CLI:
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see the following output:
Found the CPU device
Found the GPU device
Found the CPU device
Found the GPU device
3. Define a simple kernel
A GPU kernel is simply a function that runs on a GPU, executing a specific computation on a large dataset in parallel across thousands or millions of threads. You might already be familiar with threads when programming for a CPU, but GPU threads are different. On a CPU, threads are managed by the operating system and can perform completely independent tasks, such as managing a user interface, fetching data from a database, and so on. But on a GPU, threads are managed by the GPU itself. All the threads on a GPU execute the same kernel function, but they each work on a different part of the data.
When you run a kernel, you need to specify the number of threads you want to use. The number of threads you specify depends on the size of the data you want to process and the amount of parallelism you want to achieve. A common strategy is to use one thread per element of data in the result. So if you're performing an elementwise addition of two 1,024-element vectors, you'd use 1,024 threads.
A grid is the top-level organizational structure for the threads executing a kernel function. A grid consists of multiple thread blocks, which are further divided into individual threads that execute the kernel function concurrently. The GPU assigns a unique block index to each thread block, and a unique thread index to each thread within a block. Threads within the same thread block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. For this tutorial, we won't get in the details of why or how to do this, but it's an important concept to keep in mind when you're writing more complex kernels.
To better understand how grids, thread blocks, and threads work together, let's
write a simple kernel function that prints the thread block and thread indices.
Add the following code to your vector_addition.mojo
file:
from gpu.id import block_idx, thread_idx
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
from gpu.id import block_idx, thread_idx
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
4. Compile and run the kernel
Next, we need to update the main()
function to compile the kernel function for
our GPU and then run it, specifying the number of thread blocks in the grid and
the number of threads per thread block. For this initial example, let's define a
grid consisting of 2 thread blocks, each with 64 threads. Update the imports and
modify the main()
function so that your program looks like this:
from gpu.host import Dim
from gpu.id import block_idx, thread_idx
from max.driver import Accelerator, Device, accelerator, cpu
from sys import exit, has_accelerator
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()
host_device = cpu()
print("Found the CPU device")
gpu_device = accelerator()
print("Found the GPU device")
print_threads_gpu = Accelerator.compile[print_threads](gpu_device)
print_threads_gpu(gpu_device, grid_dim=Dim(2), block_dim=Dim(64))
# Required for now to keep the main thread alive until the GPU is done
Device.wait_for(gpu_device)
print("Program finished")
from gpu.host import Dim
from gpu.id import block_idx, thread_idx
from max.driver import Accelerator, Device, accelerator, cpu
from sys import exit, has_accelerator
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x,
"]\tThread index: [",
thread_idx.x,
"]"
)
def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()
host_device = cpu()
print("Found the CPU device")
gpu_device = accelerator()
print("Found the GPU device")
print_threads_gpu = Accelerator.compile[print_threads](gpu_device)
print_threads_gpu(gpu_device, grid_dim=Dim(2), block_dim=Dim(64))
# Required for now to keep the main thread alive until the GPU is done
Device.wait_for(gpu_device)
print("Program finished")
Save the file and run it:
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see something like the following output (which is abbreviated here):
Found the CPU device
Found the GPU device
Block index: [ 1 ] Thread index: [ 0 ]
Block index: [ 1 ] Thread index: [ 1 ]
Block index: [ 1 ] Thread index: [ 2 ]
...
Block index: [ 0 ] Thread index: [ 30 ]
Block index: [ 0 ] Thread index: [ 31 ]
Program finished
Found the CPU device
Found the GPU device
Block index: [ 1 ] Thread index: [ 0 ]
Block index: [ 1 ] Thread index: [ 1 ]
Block index: [ 1 ] Thread index: [ 2 ]
...
Block index: [ 0 ] Thread index: [ 30 ]
Block index: [ 0 ] Thread index: [ 31 ]
Program finished
The Accelerator.compile()
function compiles a kernel function so that it can
run on a particular GPU architecture. You specify the name of the kernel
function as a compile-time Mojo parameter and the target GPU device as a
run-time Mojo argument. (See the Functions section of
the Mojo Manual for more information on Mojo function arguments and the
Parameters section for more information on Mojo
compile-time parameters and metaprogramming.)
When you invoke the compiled kernel function, the MAX Driver executes it asynchronously on the GPU. You must provide the following arguments in this order:
- The target GPU device
- Any additional arguments specified by the kernel function definition (none, in this case)
- The grid dimensions, using the
grid_dim
keyword argument - The thread block dimensions, using the
block_dim
keyword argument
We're invoking the compiled kernel function with grid_dim=Dim(2)
and
block_dim=Dim(64)
, which means we're using a grid of 2 thread blocks,
with 64 threads each for a total of 128 threads in the grid.
When you run a kernel, the GPU assigns each thread block within the grid to a streaming multiprocessor for execution. A streaming multiprocessor (SM) is the fundamental processing unit of a GPU, designed to execute multiple parallel workloads efficiently. Each SM contains several cores, which perform the actual computations of the threads executing on the SM, along with shared resources like registers, shared memory, and control mechanisms to coordinate the execution of threads. The number of SMs and the number of cores on a GPU depends on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with 128 32-bit floating point cores per SM.
Additionally, when an SM is assigned a thread block, it divides the block into multiple warps, which are groups of 32 or 64 threads, depending on the GPU architecture. These threads execute the same instruction simultaneously in a single instruction, multiple threads (SIMT) model. The SM's warp scheduler coordinates the execution of warps on an SM's cores.
Warps are used to efficiently utilize GPU hardware by maximizing throughput and minimizing control overhead. Since GPUs are designed for high-performance parallel processing, grouping threads into warps allows for streamlined instruction scheduling and execution, reducing the complexity of managing individual threads. Multiple warps from multiple thread blocks can be active within an SM at any given time, enabling the GPU to keep execution units busy. For example, if the threads of a particular warp are blocked waiting for data from memory, the warp scheduler can immediately switch execution to another warp that's ready to run.
5. Manage grid dimensions
The grid in the previous step consisted of a one-dimensional grid of 2 thread blocks with 64 threads in each block. However, you can also organize the thread blocks in a two- or even a three-dimensional grid. Similarly, you can arrange the threads in a thread block across one, two, or three dimensions. Typically, you determine the dimensions of the grid and thread blocks based on the dimensionality of the data to process. For example, you might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video.
To better understand how grids, thread blocks, and threads work together, let's
modify our print_threads()
kernel function to print the x
, y
, and z
components of the thread block and thread indices for each thread.
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x, block_idx.y, block_idx.z,
"]\tThread index: [",
thread_idx.x, thread_idx.y, thread_idx.z,
"]"
)
fn print_threads():
"""Print thread IDs."""
print("Block index: [",
block_idx.x, block_idx.y, block_idx.z,
"]\tThread index: [",
thread_idx.x, thread_idx.y, thread_idx.z,
"]"
)
Then, update the main()
function to invoke the compiled kernel function with a
2x2x1 grid of thread blocks and a 16x4x2 arrangement of threads within each
thread block:
print_threads_gpu(
gpu_device,
grid_dim=Dim(2, 2, 1),
block_dim=Dim(16, 4, 2)
)
print_threads_gpu(
gpu_device,
grid_dim=Dim(2, 2, 1),
block_dim=Dim(16, 4, 2)
)
Save the file and run it again:
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see something like the following output (which is abbreviated here):
Found the CPU device
Found the GPU device
Block index: [ 0 1 0 ] Thread index: [ 0 2 0 ]
Block index: [ 0 1 0 ] Thread index: [ 1 2 0 ]
Block index: [ 0 1 0 ] Thread index: [ 2 2 0 ]
...
Block index: [ 1 1 0 ] Thread index: [ 14 3 0 ]
Block index: [ 1 1 0 ] Thread index: [ 15 3 0 ]
Program finished
Found the CPU device
Found the GPU device
Block index: [ 0 1 0 ] Thread index: [ 0 2 0 ]
Block index: [ 0 1 0 ] Thread index: [ 1 2 0 ]
Block index: [ 0 1 0 ] Thread index: [ 2 2 0 ]
...
Block index: [ 1 1 0 ] Thread index: [ 14 3 0 ]
Block index: [ 1 1 0 ] Thread index: [ 15 3 0 ]
Program finished
Try changing the grid and thread block dimensions to see how the output changes.
6. Model our data using Tensor
Now that you understand how to manage grid dimensions, let's get ready to create a kernel that performs a simple elementwise addition of two vectors of floating point numbers.
We'll start by determining how to represent our data. Although we're going to be using one-dimensional data, we'll use a data type called a tensor that's capable of representing multi-dimensional data — basically, a multi-dimensional array. Because we have only a single dimension of data, this will be a rank-1 tensor.
We'll use the Tensor
class from the
max.driver
package to represent our data. This Tensor
class is a convenience
class that can allocate memory for the tensor on either the CPU or the GPU. It
also includes methods for moving and copying the data between the CPU and the
GPU.
Let's add Tensor
to the list of imports and then update main()
to create two
input tensors on the CPU. We won't need the print_threads()
kernel function
anymore, so we can remove it and the code to compile and invoke it. So after all
that, your vector_addition.mojo
file should look like this:
from gpu.host import Dim
from gpu.id import block_idx, thread_idx
from max.driver import Accelerator, Tensor, accelerator, cpu
from sys import exit, has_accelerator
alias float_dtype = DType.float32
alias tensor_rank = 1
alias vector_size = 100
def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()
host_device = cpu()
gpu_device = accelerator()
# Allocate the two input tensors on the host.
lhs_tensor = Tensor[float_dtype, tensor_rank](vector_size, host_device)
rhs_tensor = Tensor[float_dtype, tensor_rank](vector_size, host_device)
# Fill them with initial values.
for i in range(vector_size):
lhs_tensor[i] = Float32(i)
rhs_tensor[i] = Float32(i * 0.5)
print("lhs_tensor:", lhs_tensor)
print("rhs_tensor:", rhs_tensor)
from gpu.host import Dim
from gpu.id import block_idx, thread_idx
from max.driver import Accelerator, Tensor, accelerator, cpu
from sys import exit, has_accelerator
alias float_dtype = DType.float32
alias tensor_rank = 1
alias vector_size = 100
def main():
if not has_accelerator():
print("A GPU is required to run this program")
exit()
host_device = cpu()
gpu_device = accelerator()
# Allocate the two input tensors on the host.
lhs_tensor = Tensor[float_dtype, tensor_rank](vector_size, host_device)
rhs_tensor = Tensor[float_dtype, tensor_rank](vector_size, host_device)
# Fill them with initial values.
for i in range(vector_size):
lhs_tensor[i] = Float32(i)
rhs_tensor[i] = Float32(i * 0.5)
print("lhs_tensor:", lhs_tensor)
print("rhs_tensor:", rhs_tensor)
The program starts by defining some compile-time aliases for the data type and
the size of the vector we're going to process. Then, the main()
function
initializes two input tensors on the CPU.
The Tensor
constructor has two compile-time parameters:
-
type
: The element data type. We're specifyingfloat_dtype
, our alias forDType.float32
. -
rank
: The rank of the tensor. We're specifyingtensor_rank
, our alias for 1.
We're also passing two run-time arguments to the constructor:
-
shape
: The shape of the tensor, which is the size of each dimension of the tensor. You can provide an instance of theTensorShape
struct, a Mojo tuple of integers, or for a rank-1 tensor a single integer. Here, we're providing a single integer,vector_size
, which is our alias for 100. -
device
: The device on which to allocate the tensor. If you omit this argument, the tensor is allocated on the host by default. Here, we're explicitly specifyinghost_device
to make our intent clear.
After allocating the tensors, we fill them with initial values and then print them.
Now let's run the program to verify that everything is working so far.
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see the following output:
lhs_tensor: Tensor([[0.0, 1.0, 2.0, ..., 97.0, 98.0, 99.0]], dtype=float32, shape=100)
rhs_tensor: Tensor([[0.0, 0.5, 1.0, ..., 48.5, 49.0, 49.5]], dtype=float32, shape=100)
lhs_tensor: Tensor([[0.0, 1.0, 2.0, ..., 97.0, 98.0, 99.0]], dtype=float32, shape=100)
rhs_tensor: Tensor([[0.0, 0.5, 1.0, ..., 48.5, 49.0, 49.5]], dtype=float32, shape=100)
7. Move the input tensors to the GPU and allocate an output tensor
Now that we have our input tensors allocated and initialized on the CPU, let's move them to the GPU so that they'll be available for the kernel function to use.
Add the following code to the end of the main()
function:
# Move the input tensors to the accelerator.
lhs_tensor = lhs_tensor.move_to(gpu_device)
rhs_tensor = rhs_tensor.move_to(gpu_device)
# Move the input tensors to the accelerator.
lhs_tensor = lhs_tensor.move_to(gpu_device)
rhs_tensor = rhs_tensor.move_to(gpu_device)
The move_to()
method returns a
new Tensor
object that is allocated on the specified device. It also
implicitly calls the destructor on the original Tensor
object, freeing the
memory associated with it. The Mojo compiler would report an error if you tried
to use the original Tensor
object after moving it to the GPU. We could declare
new variables to hold the moved tensors, but in this example we'll just reuse
the original names.
Next, let's allocate an output tensor on the GPU to hold the result of the
kernel function. Add the following code to the end of the main()
function:
# Allocate the output tensor on the accelerator.
out_tensor = Tensor[float_dtype, tensor_rank](vector_size, gpu_device)
# Allocate the output tensor on the accelerator.
out_tensor = Tensor[float_dtype, tensor_rank](vector_size, gpu_device)
8. Create LayoutTensor
views
One last step before writing the kernel function is that we're going to create a
LayoutTensor
view for each
of the tensors. LayoutTensor
provides a powerful abstraction for
multi-dimensional data with precise control over memory organization. It
supports various memory layouts (row-major, column-major, tiled),
hardware-specific optimizations, and efficient parallel access patterns. We
won't go into the details of using LayoutTensor
in this tutorial, but in more
complex kernels it's a useful tool for manipulating your data.
All we need to do is add the following code to the end of the main()
function:
# Create a LayoutTensor for each tensor.
lhs_layout_tensor = lhs_tensor.to_layout_tensor()
rhs_layout_tensor = rhs_tensor.to_layout_tensor()
out_layout_tensor = out_tensor.to_layout_tensor()
# Create a LayoutTensor for each tensor.
lhs_layout_tensor = lhs_tensor.to_layout_tensor()
rhs_layout_tensor = rhs_tensor.to_layout_tensor()
out_layout_tensor = out_tensor.to_layout_tensor()
9. Define and compile the vector addition kernel function
Now we're ready to write the kernel function. Add the following code to
vector_addition.mojo
:
from gpu.id import block_dim, block_idx, thread_idx
from layout import LayoutTensor, Layout
...
alias block_size = 32
fn vector_addition[
lhs_layout: Layout,
rhs_layout: Layout,
out_layout: Layout,
](
lhs: LayoutTensor[float_dtype, lhs_layout, MutableAnyOrigin],
rhs: LayoutTensor[float_dtype, rhs_layout, MutableAnyOrigin],
out: LayoutTensor[float_dtype, out_layout, MutableAnyOrigin],
):
"""The calculation to perform across the vector on the GPU."""
alias size = out_layout.size() # Force compile-time evaluation.
tid = block_dim.x * block_idx.x + thread_idx.x
if tid < size:
out[tid] = lhs[tid] + rhs[tid]
from gpu.id import block_dim, block_idx, thread_idx
from layout import LayoutTensor, Layout
...
alias block_size = 32
fn vector_addition[
lhs_layout: Layout,
rhs_layout: Layout,
out_layout: Layout,
](
lhs: LayoutTensor[float_dtype, lhs_layout, MutableAnyOrigin],
rhs: LayoutTensor[float_dtype, rhs_layout, MutableAnyOrigin],
out: LayoutTensor[float_dtype, out_layout, MutableAnyOrigin],
):
"""The calculation to perform across the vector on the GPU."""
alias size = out_layout.size() # Force compile-time evaluation.
tid = block_dim.x * block_idx.x + thread_idx.x
if tid < size:
out[tid] = lhs[tid] + rhs[tid]
Our vector_addition()
kernel function accepts the two input tensors and the
output tensor as arguments. It also accepts compile-time Layout
parameters for
each of the tensors. (For example, lhs_layout
is the inferred layout for the
lhs
argument.) A Layout
is a
representation of memory layouts using shape and stride information, and it maps
between logical coordinates and linear memory indices. In our kernel function,
we use only the layout of the out
tensor to determine the size of the vector.
It's important to know the size of the vector because it might not be a multiple of the block size. In fact in this example, the size of the vector is 100, which is not a multiple of our block size of 32. So as we assign our threads to read elements from the tensor, we need to make sure we don't overrun the bounds of the tensor.
The body of the kernel function starts by calculating linear index of the tensor
element that a particular thread is responsible for. The block_dim
object
(which we added to the list of imports) contains the dimensions of the thread
blocks as x
, y
, and z
values. Because we're going to use a one-dimensional
grid of thread blocks, we need only the x
dimension. We can then calculate
tid
, the unique "global" index of the thread within the output tensor as
block_dim.x * block_idx.x + thread_idx.x
. For example, the tid
values for
the threads in the first thread block range from 0 to 31. The tid
values for
the threads in the second thread block range from 32 to 63, and so on.
The function then checks if the calculated tid
is less than the size of the
output tensor. If it is, the thread reads the corresponding elements from the
lhs
and rhs
tensors, adds them together, and stores the result in the
corresponding element of the out
tensor.
Now that we've written the kernel function, we can compile it by adding the
following code to the end of the main()
function:
# Compile the kernel function to run on the GPU.
gpu_function = Accelerator.compile[
vector_addition[
lhs_layout_tensor.layout,
rhs_layout_tensor.layout,
out_layout_tensor.layout,
]
](gpu_device)
# Compile the kernel function to run on the GPU.
gpu_function = Accelerator.compile[
vector_addition[
lhs_layout_tensor.layout,
rhs_layout_tensor.layout,
out_layout_tensor.layout,
]
](gpu_device)
10. Invoke the kernel function and move the output back to the CPU
The last step is to invoke the kernel function and move the output back to the CPU. Add this line to the list of imports at the top of the file:
from math import ceildiv
from math import ceildiv
Then, add the following code to the end of the main()
function:
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
num_blocks = ceildiv(vector_size, block_size)
# Invoke the kernel function.
gpu_function(
gpu_device,
lhs_layout_tensor,
rhs_layout_tensor,
out_layout_tensor,
grid_dim=Dim(num_blocks),
block_dim=Dim(block_size),
)
# Move the output tensor back onto the CPU so that we can read the results.
out_tensor = out_tensor.move_to(host_device)
print("out_tensor:", out_tensor)
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
num_blocks = ceildiv(vector_size, block_size)
# Invoke the kernel function.
gpu_function(
gpu_device,
lhs_layout_tensor,
rhs_layout_tensor,
out_layout_tensor,
grid_dim=Dim(num_blocks),
block_dim=Dim(block_size),
)
# Move the output tensor back onto the CPU so that we can read the results.
out_tensor = out_tensor.move_to(host_device)
print("out_tensor:", out_tensor)
First we calculate the number of thread blocks needed by dividing the vector size by the block size and rounding up. Then we can invoke the kernel function.
After that, we move the output tensor back onto the CPU so that we can read the
results. This call blocks on the CPU until the kernel function has populated the
output tensor and returned. The move also has the side effect of invoking the
destructor for the original out_tensor
object and freeing its allocated memory
on the GPU. As for the input tensors, we don't need to move them back to the
CPU. Also, the Mojo compiler determines that the lhs_tensor
and rhs_tensor
objects are no longer needed after the kernel function has returned, and so it
automatically invokes their destructors to free their allocated memory on the
GPU. (For a detailed explanation of object lifetime and destruction in Mojo, see
the Death of a value section of the Mojo
Manual.)
So it's finally time to run the program to see the results of our hard work.
mojo vector_addition.mojo
mojo vector_addition.mojo
You should see the following output:
lhs_tensor: Tensor([[0.0, 1.0, 2.0, ..., 97.0, 98.0, 99.0]], dtype=float32, shape=100)
rhs_tensor: Tensor([[0.0, 0.5, 1.0, ..., 48.5, 49.0, 49.5]], dtype=float32, shape=100)
out_tensor: Tensor([[0.0, 1.5, 3.0, ..., 145.5, 147.0, 148.5]], dtype=float32, shape=100)
lhs_tensor: Tensor([[0.0, 1.0, 2.0, ..., 97.0, 98.0, 99.0]], dtype=float32, shape=100)
rhs_tensor: Tensor([[0.0, 0.5, 1.0, ..., 48.5, 49.0, 49.5]], dtype=float32, shape=100)
out_tensor: Tensor([[0.0, 1.5, 3.0, ..., 145.5, 147.0, 148.5]], dtype=float32, shape=100)
And now that you're done with the tutorial, exit your project's virtual environment:
exit
exit
Summary
In this tutorial, we've learned how to use the MAX Driver to write a simple kernel function that performs an elementwise addition of two vectors. We covered:
- Understanding basic GPU concepts like devices, grids, and thread blocks
- Moving data between CPU and GPU memory using tensors
- Writing and compiling a GPU kernel function
- Executing parallel computations on the GPU
- Managing memory and object lifetimes across devices
Now that you understand the basics of GPU programming with Mojo, here are some suggested next steps:
-
Check out more examples of GPU programming with Mojo and the MAX Driver in the public MAX GitHub repository.
-
Try implementing other parallel algorithms like matrix multiplication or convolutions.
-
Explore the MAX Driver API documentation to discover more advanced GPU programming features.
-
Learn more about other features of the MAX platform for building and deploying high-performance AI endpoints.
-
Read the GPU basics section of the Mojo Manual to get a taste of the low-level GPU programming APIs available in the
gpu
package. -
Check out the Mojo Manual for more information on the Mojo language.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!