Mojo module
memory
This module provides GPU memory operations and utilities.
The module implements low-level memory operations for GPU programming, with a focus on:
- Memory address space abstractions (global, shared, constant)
- Cache control operations and policies
- Memory access patterns and optimizations
- Memory alignment and pointer manipulation
It provides a unified interface for memory operations across different GPU architectures, with specialized implementations for NVIDIA and AMD GPUs where needed.
The module is designed for performance-critical code and requires careful usage to achieve optimal memory access patterns and cache utilization.
Aliases
AddressSpace
alias AddressSpace = _GPUAddressSpace
Structs
-
CacheEviction: Represents cache eviction policies for GPU memory operations. -
CacheOperation: Represents different GPU cache operation policies. -
Consistency: Represents memory consistency models for GPU memory operations. -
Fill: Represents memory fill patterns for GPU memory operations. -
ReduceOp: Represents reduction operations for parallel reduction algorithms.
Functions
-
async_copy: Asynchronously copies data from global memory to shared memory. -
async_copy_commit_group: Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group. -
async_copy_wait_all: Waits for completion of all committed cp.async-groups. -
async_copy_wait_group: Waits for the completion ofnmost recently committed cp.async-groups. -
cp_async_bulk_tensor_global_shared_cta: Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. -
cp_async_bulk_tensor_reduce: Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. -
cp_async_bulk_tensor_shared_cluster_global: Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory. -
cp_async_bulk_tensor_shared_cluster_global_multicast: Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster. -
external_memory: Gets a pointer to dynamically allocated external memory. -
fence_async_view_proxy: Establishes a memory fence for shared memory view operations. -
fence_mbarrier_init: Creates a memory fence after mbarrier initialization. -
fence_proxy_tensormap_generic_sys_acquire: Acquires a system-wide memory fence for tensor map operations. -
fence_proxy_tensormap_generic_sys_release: Releases the system-wide memory fence for tensor map operations. -
load: Loads data from global memory into a SIMD vector. -
multimem_ld_reduce: Performs a vectorized load-reduce operation using NVIDIA's multimem feature. -
multimem_st: Stages an inline multimem.st instruction.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!