Mojo struct
DeviceStream
struct DeviceStream
Represents a CUDA/HIP stream for asynchronous GPU operations.
A DeviceStream provides a queue for GPU operations that can execute concurrently with operations in other streams. Operations within a single stream execute in the order they are issued, but operations in different streams may execute in any relative order or concurrently.
This abstraction allows for better utilization of GPU resources by enabling overlapping of computation and data transfers.
Example:
from gpu.host import DeviceContext, DeviceStream
var ctx = DeviceContext(0) # Select first GPU
var stream = DeviceStream(ctx)
# Launch operations on the stream
# ...
# Wait for all operations in the stream to complete
stream.synchronize()
Implemented traits
AnyType
,
Copyable
,
ExplicitlyCopyable
,
Movable
,
UnknownDestructibility
Aliases
__copyinit__is_trivial
alias __copyinit__is_trivial = False
__del__is_trivial
alias __del__is_trivial = False
__moveinit__is_trivial
alias __moveinit__is_trivial = False
Methods
synchronize
synchronize(self)
Blocks the calling CPU thread until all operations in this stream complete.
This function waits until all previously issued commands in this stream have completed execution. It provides a synchronization point between host and device code.
Example:
# Launch kernel or memory operations on the stream
# ...
# Wait for completion
stream.synchronize()
# Now it's safe to use results on the host
Raises:
If synchronization fails.
enqueue_wait_for
enqueue_wait_for(self, event: DeviceEvent)
Makes this stream wait for the specified event.
This function inserts a wait operation into this stream that will block all subsequent operations in the stream until the specified event has been recorded and completed.
Args:
- event (
DeviceEvent
): The event to wait for.
Raises:
If the wait operation fails.
record_event
record_event(self, event: DeviceEvent)
Records an event in this stream.
This function records the given event at the current point in this stream. All operations in the stream that were enqueued before this call will complete before the event is triggered.
Example:
from gpu.host import DeviceContext
var ctx = DeviceContext()
var default_stream = ctx.stream()
var new_stream = ctx.create_stream()
# Create event on the default stream
var event = default_stream.create_event()
# Wait for the event on the new stream
new_stream.enqueue_wait_for(event)
# Stream 2 can continue
default_stream.record_event(event)
Args:
- event (
DeviceEvent
): The event to record.
Raises:
If event recording fails.
enqueue_function
enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), var attributes: List[LaunchAttribute] = List(, Tuple()), var constant_memory: List[ConstantMemoryMapping] = List(, Tuple()))
Enqueues a compiled function for execution on this device.
You can pass the function directly to enqueue_function
without
compiling it first:
from gpu.host import DeviceContext
fn kernel():
print("hello from the GPU")
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()
If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead:
from gpu.host import DeviceContext
with DeviceContext() as ctx:
var compiled_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.synchronize()
Parameters:
- *Ts (
AnyType
): Argument dtypes.
Args:
- f (
DeviceFunction
): The compiled function to execute. - *args (
*Ts
): Arguments to pass to the function. - grid_dim (
Dim
): Dimensions of the compute grid, made up of thread blocks. - block_dim (
Dim
): Dimensions of each thread block in the grid. - cluster_dim (
OptionalReg
): Dimensions of clusters (if the thread blocks are grouped into clusters). - shared_mem_bytes (
OptionalReg
): Amount of shared memory per thread block. - attributes (
List
): Launch attributes. - constant_memory (
List
): Constant memory mapping.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!