Mojo package
gpu
The GPU package provides low-level programming constructs for working with GPUs. These low level constructs allow you to write code that runs on the GPU with traditional programming style--partitioning work across threads that are mapped onto 1-, 2-, or 3-dimensional blocks. The thread blocks can subsequently be grouped into a grid of thread blocks.
A kernel is a function that runs on the GPU in parallel across many threads.
Currently, the
DeviceContext
struct
provides the interface for compiling and launching GPU kernels inside MAX
custom operations.
The gpu.host
package includes APIs to manage
interaction between the host (that is, the CPU) and device (that is, the GPU
or accelerator).
See the gpu.id
module for a list of aliases you
can use to access information about the grid and the current thread, including
block dimensions, block index in the grid and thread index.
The sync
module provides functions for synchronizing
threads.
For an example of launching a GPU kernel from a MAX custom operation, see the vector addition example in the MAX repo.
Packages
-
host
: Implements the gpu host package.
Modules
-
all_reduce
: -
globals
: This module includes NVIDIA GPUs global constants. -
id
: This module includes grid-related aliases and functions. Most of these are generic, a few are specific to NVIDIA GPUs. -
intrinsics
: This module includes NVIDIA GPUs intrinsics operations. -
memory
: This module includes NVIDIA GPUs memory operations. -
mma
: This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions. -
mma_util
: This module provides abstractions for doing matrix-multiply-accumulate (mma) using tensor cores. PTX Documentation => https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-1688 AMD Documentation => https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/ -
profiler
: This module includes a simple GPU profiler. -
random
: Implements a basic RNG using the Philox algorithm. -
semaphore
: Implementation of a CTA-wide semaphore for inter-CTA synchronization. -
shuffle
: This module includes intrinsics for NVIDIA GPUs shuffle instructions. -
sync
: This module includes intrinsics for NVIDIA GPUs sync instructions. -
tensor_ops
:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!