Mojo module

intrinsics

Provides low-level GPU intrinsic operations and memory access primitives.

Implements hardware-specific intrinsics that map directly to GPU assembly instructions, focusing on NVIDIA GPU architectures. Includes:

Global memory load/store operations with cache control
Warp-level primitives and synchronization
Memory fence and barrier operations
Atomic operations and memory ordering primitives

These low-level primitives should be used carefully as they correspond directly to hardware instructions and require understanding of the underlying GPU architecture.

Structs

Scope: Represents memory synchronization scope levels for GPU memory operations.

Functions

buffer_load: Loads data from global memory into a SIMD register.
buffer_load_store_lds: Loads four bytes from global memory and writes them to shared memory.
buffer_store: Stores a register variable to global memory.
byte_permute: Permutes bytes from two 32-bit integers based on a control mask.
ldg: Load data from global memory through the non-coherent cache.
load_acquire: Performs an atomic load operation with acquire memory ordering semantics.
load_volatile: Performs a volatile load operation that cannot be optimized away.
lop: Performs an arbitrary logical operation on 3 inputs using a lookup table.
make_buffer_resource: Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations.
mulhi: Calculates the most significant 32 bits of the product of two 16-bit unsigned integers.
mulwide: Performs a wide multiplication of two 32-bit unsigned integers.
store_release: Performs an atomic store with release memory ordering semantics.
store_volatile: Performs a volatile store operation that cannot be optimized away.
threadfence: Enforces ordering of memory operations across threads.
warpgroup_reg_alloc: Allocates additional registers for the executing warp group.
warpgroup_reg_dealloc: Deallocates additional registers for the executing warp group.

Structs​

Functions​

Structs

Functions