Mojo module
memory
This module provides GPU memory operations and utilities.
The module implements low-level memory operations for GPU programming, with a focus on:
- Memory address space abstractions (global, shared, constant)
- Cache control operations and policies
- Memory access patterns and optimizations
- Memory alignment and pointer manipulation
It provides a unified interface for memory operations across different GPU architectures, with specialized implementations for NVIDIA and AMD GPUs where needed.
The module is designed for performance-critical code and requires careful usage to achieve optimal memory access patterns and cache utilization.
Aliases
-
AddressSpace = _GPUAddressSpace
:
Structs
-
CacheEviction
: Represents cache eviction policies for GPU memory operations. -
CacheOperation
: Represents different GPU cache operation policies. -
Consistency
: Represents memory consistency models for GPU memory operations. -
Fill
: Represents memory fill patterns for GPU memory operations. -
ReduceOp
: Represents reduction operations for parallel reduction algorithms.
Functions
-
async_copy
: Asynchronously copies data from global memory to shared memory. -
async_copy_commit_group
: Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group. -
async_copy_wait_all
: Waits for completion of all committed cp.async-groups. -
async_copy_wait_group
: Waits for the completion ofn
most recently committed cp.async-groups. -
cp_async_bulk_tensor_global_shared_cta
: Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. -
cp_async_bulk_tensor_reduce
: Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. -
cp_async_bulk_tensor_shared_cluster_global
: Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory. -
cp_async_bulk_tensor_shared_cluster_global_multicast
: Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster. -
external_memory
: Gets a pointer to dynamically allocated external memory. -
fence_mbarrier_init
: Creates a memory fence after mbarrier initialization. -
fence_proxy_tensormap_generic_sys_acquire
: Acquires a system-wide memory fence for tensor map operations. -
fence_proxy_tensormap_generic_sys_release
: Releases the system-wide memory fence for tensor map operations. -
load
: Loads data from global memory into a SIMD vector. -
multimem_ld_reduce
: Performs a vectorized load-reduce operation using NVIDIA's multimem feature. -
multimem_st
: Stages an inline multimem.st instruction. -
tma_store_fence
: Establishes a memory fence for shared memory stores in TMA operations.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!