Skip to main content
Log in

Mojo module

intrinsics

This module provides low-level GPU intrinsic operations and memory access primitives.

The module implements hardware-specific intrinsic operations that map directly to GPU assembly instructions, with a focus on NVIDIA GPU architectures. It includes:

  • Global memory load/store operations with cache control
  • Warp-level primitives and synchronization
  • Memory fence and barrier operations
  • Atomic operations and memory ordering primitives

These low-level primitives should be used carefully as they correspond directly to hardware instructions and require understanding of the underlying GPU architecture.

Structs

  • Scope: Represents memory synchronization scope levels for GPU memory operations.

Functions

  • buffer_load: Loads a register variable from global memory.
  • buffer_load_store_lds: Loads four bytes from global memory ands writes them to shared memory.
  • buffer_store: Stores a register variable to global memory.
  • byte_permute: Permutes bytes from two 32-bit integers based on a control mask.
  • ldg: Load a register variable from global state space via non-coherent cache.
  • load_acquire: Performs an atomic load operation with acquire memory ordering semantics.
  • load_volatile: Performs a volatile load operation that cannot be optimized away.
  • lop: Performs an arbitrary logical operation on 3 input values using a lookup table.
  • make_buffer_resource: Creates a 128-bit buffer constant for buffer IO.
  • mulhi: Calculates the most significant 32 bits of the product of two 16-bit unsigned integers.
  • mulwide: Performs a wide multiplication of two 32-bit unsigned integers.
  • store_release: Performs an atomic store operation with release memory ordering semantics.
  • store_volatile: Performs a volatile store operation that cannot be optimized away.
  • threadfence: Enforces ordering of memory operations across threads.
  • warpgroup_reg_alloc: Allocates additional registers for the executing warp group.
  • warpgroup_reg_dealloc: Deallocates additional registers for the executing warp group.