Mojo struct
DeviceContext
@register_passable
struct DeviceContext
Represents a single stream of execution on a particular accelerator (GPU).
A DeviceContext
serves as the low-level interface to the
accelerator inside a MAX custom operation and provides
methods for allocating buffers on the device, copying data between host and
device, and for compiling and running functions (also known as kernels) on
the device.
The device context can be used as a context manager. For example:
from gpu.host import DeviceContext
from gpu import thread_idx
fn kernel():
print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z)
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()
from gpu.host import DeviceContext
from gpu import thread_idx
fn kernel():
print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z)
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()
A custom operation receives an opaque DeviceContextPtr
, which provides
a get_device_context()
method to retrieve the device context:
from runtime.asyncrt import DeviceContextPtr
@register("custom_op")
struct CustomOp:
@staticmethod
fn execute(ctx_ptr: DeviceContextPtr) raises:
var ctx = ctx_ptr.get_device_context()
ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()
from runtime.asyncrt import DeviceContextPtr
@register("custom_op")
struct CustomOp:
@staticmethod
fn execute(ctx_ptr: DeviceContextPtr) raises:
var ctx = ctx_ptr.get_device_context()
ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()
Aliases
device_info = from_name[stdlib::collections::string::string_slice::StringSlice[::Bool()
:gpu.info.Info
object for the default accelerator.device_api = from_name[stdlib::collections::string::string_slice::StringSlice[::Bool().api
: Device API for the default accelerator (for example, "cuda" or "hip").
Implemented traits
AnyType
,
CollectionElement
,
Copyable
,
Movable
,
UnknownDestructibility
Methods
__init__
__init__(out self, device_id: Int = 0, *, api: String = __init__[::Stringable](from_name[stdlib::collections::string::string_slice::StringSlice[::Bool().api))
Constructs a DeviceContext
for the specified device.
This initializer creates a new device context for the specified accelerator device. The device context provides an interface for interacting with the GPU, including memory allocation, data transfer, and kernel execution.
Example:
from gpu.host import DeviceContext
# Create a context for the default GPU
var ctx = DeviceContext()
# Create a context for a specific GPU (device 1)
var ctx2 = DeviceContext(1)
from gpu.host import DeviceContext
# Create a context for the default GPU
var ctx = DeviceContext()
# Create a context for a specific GPU (device 1)
var ctx2 = DeviceContext(1)
Args:
- device_id (
Int
): ID of the accelerator device. If not specified, uses the default accelerator (device 0). - api (
String
): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by the DeviceContext class.
Raises:
If device initialization fails or the specified device is not available.
__copyinit__
__copyinit__(existing: Self) -> Self
Creates a copy of an existing device context by incrementing its reference count.
This copy constructor creates a new reference to the same underlying device context by incrementing the reference count of the native context object. Both the original and the copy will refer to the same device context.
Args:
- existing (
Self
): The device context to copy.
__del__
__del__(owned self)
Releases resources associated with this device context.
This destructor decrements the reference count of the native device context. When the reference count reaches zero, the underlying resources are released, including any cached memory buffers and compiled device functions.
copy
copy(self) -> Self
Explicitly constructs a copy of this device context.
This method creates a new reference to the same underlying device context by incrementing the reference count of the native context object.
Returns:
A copy of this device context that refers to the same underlying context.
__enter__
__enter__(owned self) -> Self
Enables the use of DeviceContext in a 'with' statement context manager.
This method allows DeviceContext to be used with Python-style context managers, which ensures proper resource management and cleanup when the context exits.
Example:
from gpu.host import DeviceContext
# Using DeviceContext as a context manager
with DeviceContext() as ctx:
# Perform GPU operations
# Resources are automatically released when exiting the block
from gpu.host import DeviceContext
# Using DeviceContext as a context manager
with DeviceContext() as ctx:
# Perform GPU operations
# Resources are automatically released when exiting the block
Returns:
The DeviceContext instance to be used within the context manager block.
name
name(self) -> String
Returns the device name, an ASCII string identifying this device, defined by the native device API.
This method queries the underlying GPU device for its name, which typically includes the model and other identifying information. This can be useful for logging, debugging, or making runtime decisions based on the specific GPU hardware.
Example:
from gpu.host import DeviceContext
var ctx = DeviceContext()
print("Running on device:", ctx.name())
from gpu.host import DeviceContext
var ctx = DeviceContext()
print("Running on device:", ctx.name())
Returns:
A string containing the device name.
api
api(self) -> String
Returns the name of the API used to program the device.
This method queries the underlying device context to determine which GPU programming API is being used for the current device. This information is useful for writing code that can adapt to different GPU architectures and programming models.
Possible values are:
- "cpu": Generic host device (CPU).
- "cuda": NVIDIA GPUs.
- "hip": AMD GPUs.
Example:
from gpu.host import DeviceContext
var ctx = DeviceContext()
var api_name = ctx.api()
print("Using device API:", api_name)
# Conditionally execute code based on the API
if api_name == "cuda":
print("Running on NVIDIA GPU")
elif api_name == "hip":
print("Running on AMD GPU")
from gpu.host import DeviceContext
var ctx = DeviceContext()
var api_name = ctx.api()
print("Using device API:", api_name)
# Conditionally execute code based on the API
if api_name == "cuda":
print("Running on NVIDIA GPU")
elif api_name == "hip":
print("Running on AMD GPU")
Returns:
A string identifying the device API.
enqueue_create_buffer
enqueue_create_buffer[type: DType](self, size: Int) -> DeviceBuffer[type]
Enqueues a buffer creation using the DeviceBuffer
constructor.
For GPU devices, the space is allocated in the device's global memory.
Parameters:
- type (
DType
): The data type to be stored in the allocated memory.
Args:
- size (
Int
): The number of elements oftype
to allocate memory for.
Returns:
The allocated buffer.
create_buffer_sync
create_buffer_sync[type: DType](self, size: Int) -> DeviceBuffer[type]
Creates a buffer synchronously using the DeviceBuffer
constructor.
Parameters:
- type (
DType
): The data type to be stored in the allocated memory.
Args:
- size (
Int
): The number of elements oftype
to allocate memory for.
Returns:
The allocated buffer.
enqueue_create_host_buffer
enqueue_create_host_buffer[type: DType](self, size: Int) -> HostBuffer[type]
Enqueues the creation of a HostBuffer.
This function allocates memory on the host that is accessible by the device. The memory is page-locked (pinned) for efficient data transfer between host and device.
Pinned memory is guaranteed to remain resident in the host's RAM, not be
paged/swapped out to disk. Memory allocated normally (for example, using
UnsafePointer.alloc()
)
is pageable—individual pages of memory can be moved to secondary storage
(disk/SSD) when main memory fills up.
Using pinned memory allows devices to make fast transfers between host memory and device memory, because they can use direct memory access (DMA) to transfer data without relying on the CPU.
Allocating too much pinned memory can cause performance issues, since it reduces the amount of memory available for other processes.
Example:
from gpu.host import DeviceContext
with DeviceContext() as ctx:
# Allocate host memory accessible by the device
var host_buffer = ctx.enqueue_create_host_buffer[DType.float32](1024)
# Use the host buffer for device operations
# ...
from gpu.host import DeviceContext
with DeviceContext() as ctx:
# Allocate host memory accessible by the device
var host_buffer = ctx.enqueue_create_host_buffer[DType.float32](1024)
# Use the host buffer for device operations
# ...
Parameters:
- type (
DType
): The data type to be stored in the allocated memory.
Args:
- size (
Int
): The number of elements oftype
to allocate memory for.
Returns:
A HostBuffer
object that wraps the allocated host memory.
Raises:
If memory allocation fails or if the device context is invalid.
compile_function
compile_function[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringLiteral, fn() capturing -> Path] = __init__[::CollectionElement](False), dump_llvm: Variant[Bool, Path, StringLiteral, fn() capturing -> Path] = __init__[::CollectionElement](False)](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, target=from_name[stdlib::collections::string::string_slice::StringSlice[::Bool().target[::Int]()])
Compiles the provided function for execution on this device.
Parameters:
- func_type (
AnyTrivialRegType
): Type of the function. - func (
func_type
): The function to compile. - dump_asm (
Variant[Bool, Path, StringLiteral, fn() capturing -> Path]
): To dump the compiled assembly, passTrue
, or a file path to dump to, or a function returning a file path. - dump_llvm (
Variant[Bool, Path, StringLiteral, fn() capturing -> Path]
): To dump the generated LLVM code, passTrue
, or a file path to dump to, or a function returning a file path.
Args:
- func_attribute (
OptionalReg[FuncAttribute]
): An attribute to use when compiling the code (such as maximum shared memory size).
Returns:
The compiled function.
compile_function[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringLiteral, fn() capturing -> Path] = __init__[::CollectionElement](False), dump_llvm: Variant[Bool, Path, StringLiteral, fn() capturing -> Path] = __init__[::CollectionElement](False), _dump_sass: Variant[Bool, Path, StringLiteral, fn() capturing -> Path] = __init__[::CollectionElement](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[stdlib::collections::string::string_slice::StringSlice[::Bool().target[::Int]()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, target=_target, _ptxas_info_verbose=_ptxas_info_verbose])
Compiles the provided function for execution on this device.
Parameters:
- func_type (
AnyTrivialRegType
): Type of the function. - func (
func_type
): The function to compile. - dump_asm (
Variant[Bool, Path, StringLiteral, fn() capturing -> Path]
): To dump the compiled assembly, passTrue
, or a file path to dump to, or a function returning a file path. - dump_llvm (
Variant[Bool, Path, StringLiteral, fn() capturing -> Path]
): To dump the generated LLVM code, passTrue
, or a file path to dump to, or a function returning a file path. - _dump_sass (
Variant[Bool, Path, StringLiteral, fn() capturing -> Path]
): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. PassTrue
, or a file path to dump to, or a function returning a file path. - _ptxas_info_verbose (
Bool
): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changesdump_asm
to output verbose PTX assembly (defaultFalse
). - _target (
target
): Change the target to different device type than the one associated with thisDeviceContext
.
Args:
- func_attribute (
OptionalReg[FuncAttribute]
): An attribute to use when compiling the code (such as maximum shared memory size).
Returns:
The compiled function.
load_function
load_function[func_type: AnyTrivialRegType, //, func: func_type](self, *, function_name: StringSlice[origin], asm: StringSlice[origin], func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceExternalFunction)
Loads a pre-compiled device function from assembly code.
This method loads an external GPU function from provided assembly code (PTX/SASS) rather than compiling it from Mojo source. This is useful for integrating with existing CUDA/HIP code or for using specialized assembly optimizations.
Example:
from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction
var ctx = DeviceContext()
var ptx_code = "..." # PTX assembly code
var ext_func = DeviceExternalFunction(function_name="my_kernel", asm=ptx_code)
ctx.load_function(
function_name="my_kernel",
asm=ptx_code,
result=ext_func
)
from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction
var ctx = DeviceContext()
var ptx_code = "..." # PTX assembly code
var ext_func = DeviceExternalFunction(function_name="my_kernel", asm=ptx_code)
ctx.load_function(
function_name="my_kernel",
asm=ptx_code,
result=ext_func
)
Parameters:
- func_type (
AnyTrivialRegType
): The type of the function to load. - func (
func_type
): The function reference.
Args:
- function_name (
StringSlice[origin]
): The name of the function in the assembly code. - asm (
StringSlice[origin]
): The assembly code (PTX/SASS) containing the function. - func_attribute (
OptionalReg[FuncAttribute]
): Optional attribute to apply to the function (such as maximum shared memory size).
Returns:
The loaded function is stored in the result
parameter.
Raises:
If loading the function fails or the assembly code is invalid.
enqueue_function
enqueue_function[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringLiteral, fn() capturing -> Path] = __init__[::CollectionElement](False), dump_llvm: Variant[Bool, Path, StringLiteral, fn() capturing -> Path] = __init__[::CollectionElement](False), _dump_sass: Variant[Bool, Path, StringLiteral, fn() capturing -> Path] = __init__[::CollectionElement](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))
Compiles and enqueues a kernel for execution on this device.
You can pass the function directly to enqueue_function
without
compiling it first:
from gpu.host import DeviceContext
fn kernel():
print("hello from the GPU")
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()
from gpu.host import DeviceContext
fn kernel():
print("hello from the GPU")
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()
If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:
with DeviceContext() as ctx:
var compile_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
ctx.synchronize()
with DeviceContext() as ctx:
var compile_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
ctx.synchronize()
Parameters:
- func_type (
AnyTrivialRegType
): The type of the function to launch. - func (
func_type
): The function to launch. - *Ts (
AnyType
): The types of the arguments being passed to the function. - dump_asm (
Variant[Bool, Path, StringLiteral, fn() capturing -> Path]
): To dump the compiled assembly, passTrue
, or a file path to dump to, or a function returning a file path. - dump_llvm (
Variant[Bool, Path, StringLiteral, fn() capturing -> Path]
): To dump the generated LLVM code, passTrue
, or a file path to dump to, or a function returning a file path. - _dump_sass (
Variant[Bool, Path, StringLiteral, fn() capturing -> Path]
): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. PassTrue
, or a file path to dump to, or a function returning a file path. - _ptxas_info_verbose (
Bool
): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changesdump_asm
to output verbose PTX assembly (defaultFalse
).
Args:
- *args (
*Ts
): Variadic arguments which are passed to thefunc
. - grid_dim (
Dim
): The grid dimensions. - block_dim (
Dim
): The block dimensions. - cluster_dim (
OptionalReg[Dim]
): The cluster dimensions. - shared_mem_bytes (
OptionalReg[Int]
): Per-block memory shared between blocks. - attributes (
List[LaunchAttribute]
): AList
of launch attributes. - constant_memory (
List[ConstantMemoryMapping]
): AList
of constant memory mappings. - func_attribute (
OptionalReg[FuncAttribute]
):CUfunction_attribute
enum.
enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())
Enqueues a compiled function for execution on this device.
You can pass the function directly to enqueue_function
without
compiling it first:
from gpu.host import DeviceContext
fn kernel():
print("hello from the GPU")
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()
from gpu.host import DeviceContext
fn kernel():
print("hello from the GPU")
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()
If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead:
from gpu.host import DeviceContext
with DeviceContext() as ctx:
var compiled_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.synchronize()
from gpu.host import DeviceContext
with DeviceContext() as ctx:
var compiled_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.synchronize()
Parameters:
- *Ts (
AnyType
): Argument types.
Args:
- f (
DeviceFunction[func, target=target, _ptxas_info_verbose=_ptxas_info_verbose]
): The compiled function to execute. - *args (
*Ts
): Arguments to pass to the function. - grid_dim (
Dim
): Dimensions of the compute grid, made up of thread blocks. - block_dim (
Dim
): Dimensions of each thread block in the grid. - cluster_dim (
OptionalReg[Dim]
): Dimensions of clusters (if the thread blocks are grouped into clusters). - shared_mem_bytes (
OptionalReg[Int]
): Amount of shared memory per thread block. - attributes (
List[LaunchAttribute]
): Launch attributes. - constant_memory (
List[ConstantMemoryMapping]
): Constant memory mapping.
enqueue_function[*Ts: AnyType](self, f: DeviceExternalFunction, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())
Enqueues an external device function for asynchronous execution on the GPU.
This method schedules an external device function to be executed on the GPU with the specified execution configuration. The function and its arguments are passed to the underlying GPU runtime, which will execute them when resources are available.
Example:
from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction
# Create a device context and load an external function
with DeviceContext() as ctx:
var ext_func = DeviceExternalFunction("my_kernel")
# Enqueue the external function with execution configuration
ctx.enqueue_function(
ext_func,
grid_dim=Dim(16),
block_dim=Dim(256)
)
# Wait for completion
ctx.synchronize()
from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction
# Create a device context and load an external function
with DeviceContext() as ctx:
var ext_func = DeviceExternalFunction("my_kernel")
# Enqueue the external function with execution configuration
ctx.enqueue_function(
ext_func,
grid_dim=Dim(16),
block_dim=Dim(256)
)
# Wait for completion
ctx.synchronize()
Parameters:
- *Ts (
AnyType
): The types of the arguments to be passed to the device function.
Args:
- f (
DeviceExternalFunction
): The external device function to execute. - *args (
*Ts
): The arguments to pass to the device function. - grid_dim (
Dim
): The dimensions of the grid (number of thread blocks). - block_dim (
Dim
): The dimensions of each thread block (number of threads per block). - cluster_dim (
OptionalReg[Dim]
): Optional dimensions for thread block clusters (for newer GPU architectures). - shared_mem_bytes (
OptionalReg[Int]
): Optional amount of dynamic shared memory to allocate per block. - attributes (
List[LaunchAttribute]
): Optional list of launch attributes for fine-grained control. - constant_memory (
List[ConstantMemoryMapping]
): Optional list of constant memory mappings to use during execution.
Raises:
If there's an error enqueuing the function or if the function execution fails.
execution_time
execution_time[: origin.set, //, func: fn(DeviceContext) raises capturing -> None](self, num_iters: Int) -> Int
Measures the execution time of a function that takes a DeviceContext parameter.
This method times the execution of a provided function that requires the DeviceContext as a parameter. It runs the function for the specified number of iterations and returns the total elapsed time in nanoseconds.
Example:
from gpu.host import DeviceContext
fn gpu_operation(ctx: DeviceContext) raises capturing [_] -> None:
# Perform some GPU operation using ctx
pass
with DeviceContext() as ctx:
# Measure execution time of a function that uses the context
var time_ns = ctx.execution_time[gpu_operation](10)
print("Execution time for 10 iterations:", time_ns, "ns")
from gpu.host import DeviceContext
fn gpu_operation(ctx: DeviceContext) raises capturing [_] -> None:
# Perform some GPU operation using ctx
pass
with DeviceContext() as ctx:
# Measure execution time of a function that uses the context
var time_ns = ctx.execution_time[gpu_operation](10)
print("Execution time for 10 iterations:", time_ns, "ns")
Parameters:
- func (
fn(DeviceContext) raises capturing -> None
): A function that takes a DeviceContext parameter to execute and time.
Args:
- num_iters (
Int
): The number of iterations to run the function.
Returns:
The total elapsed time in nanoseconds for all iterations.
Raises:
If the timer operations fail or if the function raises an exception.
execution_time[: origin.set, //, func: fn() raises capturing -> None](self, num_iters: Int) -> Int
Measures the execution time of a function over multiple iterations.
This method times the execution of a provided function that doesn't require the DeviceContext as a parameter. It runs the function for the specified number of iterations and returns the total elapsed time in nanoseconds.
Example:
from gpu.host import DeviceContext
fn some_gpu_operation() raises capturing [_] -> None:
# Perform some GPU operation
pass
with DeviceContext() as ctx:
# Measure execution time of a function
var time_ns = ctx.execution_time[some_gpu_operation]
print("Execution time:", time_ns, "ns")
from gpu.host import DeviceContext
fn some_gpu_operation() raises capturing [_] -> None:
# Perform some GPU operation
pass
with DeviceContext() as ctx:
# Measure execution time of a function
var time_ns = ctx.execution_time[some_gpu_operation]
print("Execution time:", time_ns, "ns")
Parameters:
- func (
fn() raises capturing -> None
): A function with no parameters to execute and time.
Args:
- num_iters (
Int
): The number of iterations to run the function.
Returns:
The total elapsed time in nanoseconds for all iterations.
Raises:
If the timer operations fail or if the function raises an exception.
execution_time_iter
execution_time_iter[: origin.set, //, func: fn(DeviceContext, Int) raises capturing -> None](self, num_iters: Int) -> Int
Measures the execution time of a function that takes iteration index as input.
This method times the execution of a provided function that requires both the DeviceContext and the current iteration index as parameters. It runs the function for the specified number of iterations, passing the iteration index to each call, and returns the total elapsed time in nanoseconds.
Example:
from gpu.host import DeviceContext
var my_kernel = DeviceFunction(...)
fn benchmark_kernel(ctx: DeviceContext, i: Int) raises capturing [_] -> None:
# Run kernel with different parameters based on iteration
ctx.enqueue_function[my_kernel](grid_dim=Dim(i), block_dim=Dim(256))
with DeviceContext() as ctx:
# Measure execution time with iteration awareness
var time_ns = ctx.execution_time_iter[benchmark_kernel](10)
print("Total execution time:", time_ns, "ns")
from gpu.host import DeviceContext
var my_kernel = DeviceFunction(...)
fn benchmark_kernel(ctx: DeviceContext, i: Int) raises capturing [_] -> None:
# Run kernel with different parameters based on iteration
ctx.enqueue_function[my_kernel](grid_dim=Dim(i), block_dim=Dim(256))
with DeviceContext() as ctx:
# Measure execution time with iteration awareness
var time_ns = ctx.execution_time_iter[benchmark_kernel](10)
print("Total execution time:", time_ns, "ns")
Parameters:
- func (
fn(DeviceContext, Int) raises capturing -> None
): A function that takes the DeviceContext and an iteration index.
Args:
- num_iters (
Int
): The number of iterations to run the function.
Returns:
The total elapsed time in nanoseconds for all iterations.
Raises:
If the timer operations fail or if the function raises an exception.
enqueue_copy
enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type, address_space, mut, origin], src_ptr: UnsafePointer[SIMD[type, 1]])
Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_buf (
DeviceBuffer[type, address_space, mut, origin]
): Device buffer to copy to. - src_ptr (
UnsafePointer[SIMD[type, 1]]
): Host pointer to copy from.
enqueue_copy[type: DType](self, dst_buf: HostBuffer[type, address_space, mut, origin], src_ptr: UnsafePointer[SIMD[type, 1]])
Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_buf (
HostBuffer[type, address_space, mut, origin]
): Device buffer to copy to. - src_ptr (
UnsafePointer[SIMD[type, 1]]
): Host pointer to copy from.
enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: DeviceBuffer[type, address_space, mut, origin])
Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_ptr (
UnsafePointer[SIMD[type, 1]]
): Host pointer to copy to. - src_buf (
DeviceBuffer[type, address_space, mut, origin]
): Device buffer to copy from.
enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: HostBuffer[type, address_space, mut, origin])
Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_ptr (
UnsafePointer[SIMD[type, 1]]
): Host pointer to copy to. - src_buf (
HostBuffer[type, address_space, mut, origin]
): Device buffer to copy from.
enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_ptr: UnsafePointer[SIMD[type, 1]], size: Int)
Enqueues an async copy of size
elements from a device pointer to another device pointer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_ptr (
UnsafePointer[SIMD[type, 1]]
): Host pointer to copy to. - src_ptr (
UnsafePointer[SIMD[type, 1]]
): Device pointer to copy from. - size (
Int
): Number of elements (of the specifiedDType
) to copy.
enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type, address_space, mut, origin], src_buf: DeviceBuffer[type, address_space, mut, origin])
Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_buf (
DeviceBuffer[type, address_space, mut, origin]
): Device buffer to copy to. - src_buf (
DeviceBuffer[type, address_space, mut, origin]
): Device buffer to copy from. Must be at least as large asdst
.
enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type, address_space, mut, origin], src_buf: HostBuffer[type, address_space, mut, origin])
Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_buf (
DeviceBuffer[type, address_space, mut, origin]
): Device buffer to copy to. - src_buf (
HostBuffer[type, address_space, mut, origin]
): Device buffer to copy from. Must be at least as large asdst
.
enqueue_copy[type: DType](self, dst_buf: HostBuffer[type, address_space, mut, origin], src_buf: DeviceBuffer[type, address_space, mut, origin])
Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_buf (
HostBuffer[type, address_space, mut, origin]
): Device buffer to copy to. - src_buf (
DeviceBuffer[type, address_space, mut, origin]
): Device buffer to copy from. Must be at least as large asdst
.
enqueue_copy[type: DType](self, dst_buf: HostBuffer[type, address_space, mut, origin], src_buf: HostBuffer[type, address_space, mut, origin])
Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.
Parameters:
- type (
DType
): Type of the data being copied.
Args:
- dst_buf (
HostBuffer[type, address_space, mut, origin]
): Device buffer to copy to. - src_buf (
HostBuffer[type, address_space, mut, origin]
): Device buffer to copy from. Must be at least as large asdst
.
enqueue_memset
enqueue_memset[type: DType](self, dst: DeviceBuffer[type, address_space, mut, origin], val: SIMD[type, 1])
Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value.
Parameters:
- type (
DType
): Type of the data stored in the buffer.
Args:
- dst (
DeviceBuffer[type, address_space, mut, origin]
): Destination buffer. - val (
SIMD[type, 1]
): Value to set all elements ofdst
to.
enqueue_memset[type: DType](self, dst: HostBuffer[type, address_space, mut, origin], val: SIMD[type, 1])
Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value.
Parameters:
- type (
DType
): Type of the data stored in the buffer.
Args:
- dst (
HostBuffer[type, address_space, mut, origin]
): Destination buffer. - val (
SIMD[type, 1]
): Value to set all elements ofdst
to.
memset
memset[type: DType](self, dst: DeviceBuffer[type], val: SIMD[type, 1])
Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value.
Parameters:
- type (
DType
): Type of the data stored in the buffer.
Args:
- dst (
DeviceBuffer[type]
): Destination buffer. - val (
SIMD[type, 1]
): Value to set all elements ofdst
to.
synchronize
synchronize(self)
Blocks until all asynchronous calls on the stream associated with this device context have completed.
This should never be necessary when writing a custom operation.
enqueue_wait_for
enqueue_wait_for(self, other: Self)
Enqueues a wait operation for another device context to complete its work.
This method creates a dependency between two device contexts, ensuring that operations in the current context will not begin execution until all previously enqueued operations in the other context have completed. This is useful for synchronizing work across multiple devices or streams.
Example:
from gpu.host import DeviceContext
# Create two device contexts
var ctx1 = DeviceContext(0) # First GPU
var ctx2 = DeviceContext(1) # Second GPU
# Enqueue operations on ctx1
# ...
# Make ctx2 wait for ctx1 to complete before proceeding
ctx2.enqueue_wait_for(ctx1)
# Enqueue operations on ctx2 that depend on ctx1's completion
# ...
from gpu.host import DeviceContext
# Create two device contexts
var ctx1 = DeviceContext(0) # First GPU
var ctx2 = DeviceContext(1) # Second GPU
# Enqueue operations on ctx1
# ...
# Make ctx2 wait for ctx1 to complete before proceeding
ctx2.enqueue_wait_for(ctx1)
# Enqueue operations on ctx2 that depend on ctx1's completion
# ...
Args:
- other (
Self
): The device context whose operations must complete before operations in this context can proceed.
Raises:
If there's an error enqueuing the wait operation or if the operation is not supported by the underlying device API.
get_driver_version
get_driver_version(self) -> Int
Returns the driver version associated with this device.
This method retrieves the version number of the GPU driver currently installed on the system for the device associated with this context. The version is returned as an integer that can be used to check compatibility with specific features or to troubleshoot driver-related issues.
Example:
from gpu.host import DeviceContext
with DeviceContext() as ctx:
# Get the driver version
var driver_version = ctx.get_driver_version()
print("GPU driver version:", driver_version)
from gpu.host import DeviceContext
with DeviceContext() as ctx:
# Get the driver version
var driver_version = ctx.get_driver_version()
print("GPU driver version:", driver_version)
Returns:
An integer representing the driver version.
Raises:
If the driver version cannot be retrieved or if the device context is invalid.
get_attribute
get_attribute(self, attr: DeviceAttribute) -> Int
Returns the specified attribute for this device.
Args:
- attr (
DeviceAttribute
): The device attribute to query.
Returns:
The value for attr
on this device.
is_compatible
is_compatible(self)
Returns True if this device is compatible with MAX.
This method checks whether the current device is compatible with the Modular Accelerated Execution (MAX) runtime. It's useful for validating that the device can execute the compiled code before attempting operations.
Example:
from gpu.host import DeviceContext
var ctx = DeviceContext()
try:
ctx.is_compatible() # Verify compatibility
# Continue with device operations
except:
print("Device is not compatible with MAX")
from gpu.host import DeviceContext
var ctx = DeviceContext()
try:
ctx.is_compatible() # Verify compatibility
# Continue with device operations
except:
print("Device is not compatible with MAX")
Raises:
If the device is not compatible with MAX.
id
id(self) -> SIMD[int64, 1]
Returns the ID associated with this device.
This method retrieves the unique identifier for the current device. Device IDs are used to distinguish between multiple devices in a system and are often needed for multi-GPU programming.
Example:
var ctx = DeviceContext()
try:
var device_id = ctx.id()
print("Using device with ID:", device_id)
except:
print("Failed to get device ID")
var ctx = DeviceContext()
try:
var device_id = ctx.id()
print("Using device with ID:", device_id)
except:
print("Failed to get device ID")
Returns:
The unique device ID as an Int64.
Raises:
If there's an error retrieving the device ID.
get_memory_info
get_memory_info(self) -> Tuple[UInt, UInt]
Returns the free and total memory size for this device.
This method queries the current state of device memory, providing information about how much memory is available and the total memory capacity of the device. This is useful for memory management and determining if there's enough space for planned operations.
Example:
from gpu.host import DeviceContext
var ctx = DeviceContext()
try:
(free, total) = ctx.get_memory_info()
print("Free memory:", free / (1024*1024), "MB")
print("Total memory:", total / (1024*1024), "MB")
except:
print("Failed to get memory information")
from gpu.host import DeviceContext
var ctx = DeviceContext()
try:
(free, total) = ctx.get_memory_info()
print("Free memory:", free / (1024*1024), "MB")
print("Total memory:", total / (1024*1024), "MB")
except:
print("Failed to get memory information")
Returns:
A tuple of (free memory, total memory) in bytes.
Raises:
If there's an error retrieving the memory information.
can_access
can_access(self, peer: Self) -> Bool
Returns True if this device can access the identified peer device.
This method checks whether the current device can directly access memory on the specified peer device. Peer-to-peer access allows for direct memory transfers between devices without going through host memory, which can significantly improve performance in multi-GPU scenarios.
Example:
from gpu.host import DeviceContext
var ctx1 = DeviceContext(0) # First GPU
var ctx2 = DeviceContext(1) # Second GPU
try:
if ctx1.can_access(ctx2):
print("Direct peer access is possible")
ctx1.enable_peer_access(ctx2)
else:
print("Direct peer access is not supported")
except:
print("Failed to check peer access capability")
from gpu.host import DeviceContext
var ctx1 = DeviceContext(0) # First GPU
var ctx2 = DeviceContext(1) # Second GPU
try:
if ctx1.can_access(ctx2):
print("Direct peer access is possible")
ctx1.enable_peer_access(ctx2)
else:
print("Direct peer access is not supported")
except:
print("Failed to check peer access capability")
Args:
- peer (
Self
): The peer device to check for accessibility.
Returns:
True if the current device can access the peer device, False otherwise.
Raises:
If there's an error checking peer access capability.
enable_peer_access
enable_peer_access(self, peer: Self)
Enables direct memory access to the peer device.
This method establishes peer-to-peer access from the current device to the specified peer device. Once enabled, the current device can directly read from and write to memory allocated on the peer device without going through host memory, which can significantly improve performance for multi-GPU operations.
Notes:
- It's recommended to call
can_access()
first to check if peer access is possible. - Peer access is not always symmetric; you may need to enable access in both directions.
Example:
from gpu.host import DeviceContext
var ctx1 = DeviceContext(0) # First GPU
var ctx2 = DeviceContext(1) # Second GPU
try:
if ctx1.can_access(ctx2):
ctx1.enable_peer_access(ctx2)
print("Peer access enabled from device 0 to device 1")
# For bidirectional access
if ctx2.can_access(ctx1):
ctx2.enable_peer_access(ctx1)
print("Peer access enabled from device 1 to device 0")
else:
print("Peer access not supported between these devices")
except:
print("Failed to enable peer access")
from gpu.host import DeviceContext
var ctx1 = DeviceContext(0) # First GPU
var ctx2 = DeviceContext(1) # Second GPU
try:
if ctx1.can_access(ctx2):
ctx1.enable_peer_access(ctx2)
print("Peer access enabled from device 0 to device 1")
# For bidirectional access
if ctx2.can_access(ctx1):
ctx2.enable_peer_access(ctx1)
print("Peer access enabled from device 1 to device 0")
else:
print("Peer access not supported between these devices")
except:
print("Failed to enable peer access")
Args:
- peer (
Self
): The peer device to enable access to.
Raises:
If there's an error enabling peer access or if peer access is not supported between the devices.
supports_multicast
supports_multicast(self) -> Bool
Returns True if this device supports multicast memory mappings.
Returns:
True if the current device supports multicast memory, False otherwise.
Raises:
If there's an error checking peer access capability.
number_of_devices
static number_of_devices(*, api: String = __init__[::Stringable](from_name[stdlib::collections::string::string_slice::StringSlice[::Bool().api)) -> Int
Returns the number of devices available that support the specified API.
This function queries the system for available devices that support the requested API (such as CUDA or HIP). It's useful for determining how many accelerators are available before allocating resources or distributing work.
Example:
from gpu.host import DeviceContext
# Get number of CUDA devices
var num_cuda_devices = DeviceContext.number_of_devices(api="cuda")
# Get number of devices for the default API
var num_devices = DeviceContext.number_of_devices()
from gpu.host import DeviceContext
# Get number of CUDA devices
var num_cuda_devices = DeviceContext.number_of_devices(api="cuda")
# Get number of devices for the default API
var num_devices = DeviceContext.number_of_devices()
Args:
- api (
String
): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by the DeviceContext class.
Returns:
The number of available devices supporting the specified API.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!