Mojo module

memory

This module provides GPU memory operations and utilities.

The module implements low-level memory operations for GPU programming, with a focus on:

Memory address space abstractions (global, shared, constant)
Cache control operations and policies
Memory access patterns and optimizations
Memory alignment and pointer manipulation

It provides a unified interface for memory operations across different GPU architectures, with specialized implementations for NVIDIA and AMD GPUs where needed.

The module is designed for performance-critical code and requires careful usage to achieve optimal memory access patterns and cache utilization.

Aliases

`AddressSpace`

alias AddressSpace = _GPUAddressSpace

Structs

CacheEviction: Represents cache eviction policies for GPU memory operations.
CacheOperation: Represents different GPU cache operation policies.
Consistency: Represents memory consistency models for GPU memory operations.
Fill: Represents memory fill patterns for GPU memory operations.
ReduceOp: Represents reduction operations for parallel reduction algorithms.

Functions

async_copy: Asynchronously copies data from global memory to shared memory.
async_copy_commit_group: Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group.
async_copy_wait_all: Waits for completion of all committed cp.async-groups.
async_copy_wait_group: Waits for the completion of n most recently committed cp.async-groups.
cp_async_bulk_tensor_global_shared_cta: Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.
cp_async_bulk_tensor_reduce: Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.
cp_async_bulk_tensor_shared_cluster_global: Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory.
cp_async_bulk_tensor_shared_cluster_global_multicast: Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster.
external_memory: Gets a pointer to dynamically allocated external memory.
fence_async_view_proxy: Establishes a memory fence for shared memory view operations.
fence_mbarrier_init: Creates a memory fence after mbarrier initialization.
fence_proxy_tensormap_generic_sys_acquire: Acquires a system-wide memory fence for tensor map operations.
fence_proxy_tensormap_generic_sys_release: Releases the system-wide memory fence for tensor map operations.
load: Loads data from global memory into a SIMD vector.
multimem_ld_reduce: Performs a vectorized load-reduce operation using NVIDIA's multimem feature.
multimem_st: Stages an inline multimem.st instruction.

Aliases​

AddressSpace​

Structs​

Functions​

Aliases

`AddressSpace`

Structs

Functions