Mojo struct

DeviceStream

struct DeviceStream

Represents a CUDA/HIP stream for asynchronous GPU operations.

A DeviceStream provides a queue for GPU operations that can execute concurrently with operations in other streams. Operations within a single stream execute in the order they are issued, but operations in different streams may execute in any relative order or concurrently.

This abstraction allows for better utilization of GPU resources by enabling overlapping of computation and data transfers.

Example:

from gpu.host import DeviceContext, DeviceStream
var ctx = DeviceContext(0)  # Select first GPU
var stream = DeviceStream(ctx)

# Launch operations on the stream
# ...

# Wait for all operations in the stream to complete
stream.synchronize()

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, Movable, UnknownDestructibility

Aliases

`copyinitis_trivial`

alias __copyinit__is_trivial = False

`delis_trivial`

alias __del__is_trivial = False

`moveinitis_trivial`

alias __moveinit__is_trivial = True

Methods

`synchronize`

synchronize(self)

Blocks the calling CPU thread until all operations in this stream complete.

This function waits until all previously issued commands in this stream have completed execution. It provides a synchronization point between host and device code.

Example:

# Launch kernel or memory operations on the stream
# ...

# Wait for completion
stream.synchronize()

# Now it's safe to use results on the host

Raises:

If synchronization fails.

`enqueue_wait_for`

enqueue_wait_for(self, event: DeviceEvent)

Makes this stream wait for the specified event.

This function inserts a wait operation into this stream that will block all subsequent operations in the stream until the specified event has been recorded and completed.

Args:

event (DeviceEvent): The event to wait for.

Raises:

If the wait operation fails.

`record_event`

record_event(self, event: DeviceEvent)

Records an event in this stream.

This function records the given event at the current point in this stream. All operations in the stream that were enqueued before this call will complete before the event is triggered.

Example:

from gpu.host import DeviceContext

var ctx = DeviceContext()

var default_stream = ctx.stream()
var new_stream = ctx.create_stream()

# Create event on the default stream
var event = default_stream.create_event()

# Wait for the event on the new stream
new_stream.enqueue_wait_for(event)

# Stream 2 can continue
default_stream.record_event(event)

Args:

event (DeviceEvent): The event to record.

Raises:

If event recording fails.

`enqueue_function`

enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()))

Enqueues a compiled function for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead:

from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

*Ts (AnyType): Argument dtypes.

Args:

f (DeviceFunction): The compiled function to execute.
*args (*Ts): Arguments to pass to the function.
grid_dim (Dim): Dimensions of the compute grid, made up of thread blocks.
block_dim (Dim): Dimensions of each thread block in the grid.
cluster_dim (OptionalReg): Dimensions of clusters (if the thread blocks are grouped into clusters).
shared_mem_bytes (OptionalReg): Amount of shared memory per thread block.
attributes (List): Launch attributes.
constant_memory (List): Constant memory mapping.

Raises:

If the operation fails.

Implemented traits​

Aliases​

__copyinit__is_trivial​

__del__is_trivial​

__moveinit__is_trivial​

Methods​

synchronize​

enqueue_wait_for​

record_event​

enqueue_function​

Implemented traits

Aliases

`copyinitis_trivial`

`delis_trivial`

`moveinitis_trivial`

Methods

`synchronize`

`enqueue_wait_for`

`record_event`

`enqueue_function`