For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo struct
SHMEMContext
struct SHMEMContext[tcp: Bool = False]
Usable as a context manager to run kernels on a GPU with SHMEM support, on exit it will finalize SHMEM and clean up resources.
Example:
from shmem import SHMEMContext
with SHMEMContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)Implemented traitsβ
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable
Methodsβ
__init__β
__init__(out self, team: Int32 = Int32(2)) where (tcp == False)
Initializes a device context with SHMEM support.
This constructor initializes MPI and SHMEM, and creates a device context for the current PE's assigned GPU device.
Warning: if you're not using this as a context manager, you must call
SHMEMContext.finalize() manually.
Raises:
If initialization fails.
__init__(out self, ctx: DeviceContext) where (tcp == False)
Initializes a device context with SHMEM support, using one thread per GPU.
This constructor expects that MPI has already been initialized in the main thread, it then initializes SHMEM, and creates a device context for the associated PE on this node.
Warning: if you're not using this as a context manager, you must call
SHMEMContext.finalize() manually.
Raises:
If initialization fails.
__init__(out self, ctx: DeviceContext, node_id: Int = -1, total_nodes: Int = -1, gpus_per_node: Int = -1, server_ip: String = "-1", server_port: Int = -1) where tcp
Initializes a device context with SHMEM support, using one thread per GPU and TCP bootstrapping with a unique ID.
Warning: if you're not using this as a context manager, you must call
SHMEMContext.finalize() manually.
Raises:
If initialization fails.
__del__β
__del__(deinit self)
Context manager exit method.
Automatically finalizes SHMEM when exiting the context.
__enter__β
__enter__(var self) -> Self
Context manager entry method.
Returns:
Self: Self for use in with statements.
finalizeβ
finalize(mut self)
Finalizes the SHMEM runtime environment.
Cleans up SHMEM and MPI resources.
Raises:
If SHMEM or MPI finalization fails.
barrier_allβ
barrier_all(self)
Performs a barrier synchronization across all PEs.
All PEs must call this function before any PE can proceed past the barrier.
Raises:
If the barrier operation fails.
enqueue_create_bufferβ
enqueue_create_buffer[dtype: DType](self, size: Int) -> SHMEMBuffer[dtype]
Creates a SHMEM buffer that can be accessed by all PEs.
Parameters:
- βdtype (
DType): The data type of elements in the buffer.
Args:
- βsize (
Int): Number of elements in the buffer.
Returns:
SHMEMBuffer[dtype]: A SHMEMBuffer instance for the allocated memory.
Raises:
String: If buffer creation fails.
enqueue_functionβ
enqueue_function[declared_arg_types: TypeList[declared_arg_types.values], //, func: def(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types.values, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List(__list_literal__=NoneType(None)), var constant_memory: List[ConstantMemoryMapping] = List(__list_literal__=NoneType(None)), func_attribute: OptionalReg[FuncAttribute] = None)
Compiles and enqueues a kernel for execution on this device.
You can pass the function directly to enqueue_function without
compiling it first:
from shmem import SHMEMContext
def kernel():
print("hello from the GPU")
with SHMEMContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()Parameters:
- βdeclared_arg_types (
TypeList[declared_arg_types.values]): The declared argument types from the function signature (usually inferred). - βfunc (
def(*args: *declared_arg_types) -> None): The function to launch. - β*actual_arg_types (
DevicePassable): The types of the arguments being passed (usually inferred). - βdump_asm (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): To dump the compiled assembly, passTrue, or a file path to dump to, or a function returning a file path. - βdump_llvm (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): To dump the generated LLVM code, passTrue, or a file path to dump to, or a function returning a file path. - β_dump_sass (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. PassTrue, or a file path to dump to, or a function returning a file path. - β_ptxas_info_verbose (
Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changesdump_asmto output verbose PTX assembly (defaultFalse).
Args:
- β*args (
*actual_arg_types.values): Variadic arguments which are passed to thefunc. - βgrid_dim (
Dim): The grid dimensions. - βblock_dim (
Dim): The block dimensions. - βcluster_dim (
OptionalReg[Dim]): The cluster dimensions. - βshared_mem_bytes (
OptionalReg[Int]): Per-block memory shared between blocks. - βattributes (
List[LaunchAttribute]): AListof launch attributes. - βconstant_memory (
List[ConstantMemoryMapping]): AListof constant memory mappings. - βfunc_attribute (
OptionalReg[FuncAttribute]):CUfunction_attributeenum.
enqueue_function_collective_checkedβ
enqueue_function_collective_checked[declared_arg_types: TypeList[declared_arg_types.values], //, func: def(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types.values, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List(__list_literal__=NoneType(None)), var constant_memory: List[ConstantMemoryMapping] = List(__list_literal__=NoneType(None)), func_attribute: OptionalReg[FuncAttribute] = None)
Compiles and enqueues a kernel for execution on this device.
You can pass the function directly to enqueue_function without
compiling it first:
from std.gpu.host import DeviceContext
def kernel():
print("hello from the GPU")
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:
with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()Parameters:
- βdeclared_arg_types (
TypeList[declared_arg_types.values]): The declared argument types from the function signature (usually inferred). - βfunc (
def(*args: *declared_arg_types) -> None): The function to launch. - β*actual_arg_types (
DevicePassable): The types of the arguments being passed (usually inferred). - βdump_asm (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): To dump the compiled assembly, passTrue, or a file path to dump to, or a function returning a file path. - βdump_llvm (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): To dump the generated LLVM code, passTrue, or a file path to dump to, or a function returning a file path. - β_dump_sass (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. PassTrue, or a file path to dump to, or a function returning a file path. - β_ptxas_info_verbose (
Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changesdump_asmto output verbose PTX assembly (defaultFalse).
Args:
- β*args (
*actual_arg_types.values): Variadic arguments which are passed to thefunc. - βgrid_dim (
Dim): The grid dimensions. - βblock_dim (
Dim): The block dimensions. - βcluster_dim (
OptionalReg[Dim]): The cluster dimensions. - βshared_mem_bytes (
OptionalReg[Int]): Per-block memory shared between blocks. - βattributes (
List[LaunchAttribute]): AListof launch attributes. - βconstant_memory (
List[ConstantMemoryMapping]): AListof constant memory mappings. - βfunc_attribute (
OptionalReg[FuncAttribute]):CUfunction_attributeenum.
synchronizeβ
synchronize(self)
Blocks until all asynchronous calls on the stream associated with this device context have completed.
Raises:
If synchronization fails.
get_device_contextβ
get_device_context(self) -> DeviceContext
Returns the device context associated with this SHMEMContext.
Returns:
DeviceContext: The device context associated with this SHMEMContext.
number_of_devicesβ
static number_of_devices(*, api: String = DeviceContext.default_device_info.api) -> Int
Returns the number of devices available that support the specified API.
This function queries the system for available devices that support the requested API (such as CUDA or HIP). It's useful for determining how many accelerators are available before allocating resources or distributing work.
Args:
- βapi (
String): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by current target accelerator.
Returns:
Int: The number of available devices supporting the specified API.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!