Mojo module
memory
This module provides GPU memory operations and utilities.
The module implements low-level memory operations for GPU programming, with a focus on:
- Memory address space abstractions (global, shared, constant)
- Cache control operations and policies
- Memory access patterns and optimizations
- Memory alignment and pointer manipulation
It provides a unified interface for memory operations across different GPU architectures, with specialized implementations for NVIDIA and AMD GPUs where needed.
The module is designed for performance-critical code and requires careful usage to achieve optimal memory access patterns and cache utilization.
Aliases
AddressSpace
alias AddressSpace = _GPUAddressSpace
Structs
- CacheEviction: Represents cache eviction policies for GPU memory operations.
- CacheOperation: Represents different GPU cache operation policies.
- Consistency: Represents memory consistency models for GPU memory operations.
- Fill: Represents memory fill patterns for GPU memory operations.
- ReduceOp: Represents reduction operations for parallel reduction algorithms.
Functions
- async_copy: Asynchronously copies data from global memory to shared memory.
- async_copy_commit_group: Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group.
- async_copy_wait_all: Waits for completion of all committed cp.async-groups.
- async_copy_wait_group: Waits for the completion ofnmost recently committed cp.async-groups.
- cp_async_bulk_tensor_global_shared_cta: Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.
- cp_async_bulk_tensor_reduce: Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.
- cp_async_bulk_tensor_shared_cluster_global: Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory.
- cp_async_bulk_tensor_shared_cluster_global_multicast: Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster.
- external_memory: Gets a pointer to dynamically allocated external memory.
- fence_async_view_proxy: Establishes a memory fence for shared memory view operations.
- fence_mbarrier_init: Creates a memory fence after mbarrier initialization.
- fence_proxy_tensormap_generic_sys_acquire: Acquires a system-wide memory fence for tensor map operations.
- fence_proxy_tensormap_generic_sys_release: Releases the system-wide memory fence for tensor map operations.
- load: Loads data from global memory into a SIMD vector.
- multimem_ld_reduce: Performs a vectorized load-reduce operation using NVIDIA's multimem feature.
- multimem_st: Stages an inline multimem.st instruction.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
