Mojo module
intrinsics
Provides low-level GPU intrinsic operations and memory access primitives.
Implements hardware-specific intrinsics that map directly to GPU assembly instructions, focusing on NVIDIA GPU architectures. Includes:
- Global memory load/store operations with cache control
- Warp-level primitives and synchronization
- Memory fence and barrier operations
- Atomic operations and memory ordering primitives
These low-level primitives should be used carefully as they correspond directly to hardware instructions and require understanding of the underlying GPU architecture.
Structs
-
AMDBufferResource: 128-bit descriptor for a buffer resource on AMD GPUs. -
Scope: Represents memory synchronization scope levels for GPU memory operations.
Functions
-
byte_permute: Permutes bytes from two 32-bit integers based on a control mask. -
ds_read_tr16_b64: Reads a 64-bit LDS transpose block using TR16 layout and returns SIMD[dtype, 4] of 16-bit types. -
get_ib_sts: Returns the IB status of the current thread. -
ldg: Load data from global memory through the non-coherent cache. -
load_acquire: Performs an atomic load operation with acquire memory ordering semantics. -
load_relaxed: Performs an atomic load with relaxed memory ordering semantics. -
load_volatile: Performs a volatile load operation that cannot be optimized away. -
lop: Performs an arbitrary logical operation on 3 inputs using a lookup table. -
mulhi: Calculates the most significant 32 bits of the product of two 16-bit unsigned integers. -
mulwide: Performs a wide multiplication of two 32-bit unsigned integers. -
permlane_shuffle: Shuffles SIMD values across lanes using AMD permlane operations. -
permlane_swap: Swaps values between lanes using AMD permlane swap instruction. -
store_relaxed: Performs an atomic store with relaxed memory ordering semantics. -
store_release: Performs an atomic store with release memory ordering semantics. -
store_volatile: Performs a volatile store operation that cannot be optimized away. -
threadfence: Enforces ordering of memory operations across threads. -
warpgroup_reg_alloc: Allocates additional registers for the executing warp group. -
warpgroup_reg_dealloc: Deallocates additional registers for the executing warp group.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!