Mojo module
cluster
This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures.
The module implements thread block cluster operations that enable efficient communication and synchronization between thread blocks (CTAs) within a cluster on NVIDIA Hopper architecture and newer GPUs.
All functions are constrained to NVIDIA SM90+ GPUs and will raise an error if used on unsupported hardware.
Note: These are low-level primitives that correspond directly to PTX/NVVM instructions and should be used with careful consideration of the underlying hardware synchronization mechanisms.
Functions
-
block_rank_in_cluster
: Returns the unique identifier (rank) for the current thread block within its cluster. -
cluster_arrive
: Signals arrival at a cluster synchronization point with memory ordering guarantees. -
cluster_arrive_relaxed
: Signals arrival at a cluster synchronization point with relaxed memory ordering. -
cluster_mask_base
: Computes the base mask for a cluster. Base mask in an axis masks the first cta in cluster and all ctas along the same axis. Example for cluster shape (4, 4, 1), note that cta rank is contiguous along the first cluster axis. -
cluster_sync
: Performs a full cluster synchronization with memory ordering guarantees. -
cluster_sync_acquire
: Acquires the cluster sync proxy. -
cluster_sync_relaxed
: Performs a full cluster synchronization with relaxed memory ordering. -
cluster_sync_release
: Release the cluster sync proxy. -
cluster_wait
: Waits for all thread blocks in the cluster to arrive at the synchronization point. -
clusterlaunchcontrol_query_cancel_get_first_ctaid
: Decodes the cancellation request. -
clusterlaunchcontrol_query_cancel_get_first_ctaid_v4
: Decodes the cancellation request. -
clusterlaunchcontrol_query_cancel_is_canceled
: Decodes the cancellation request. -
clusterlaunchcontrol_try_cancel
: Requests to atomically cancel the cluster launch if it has not started running yet. -
elect_one_sync
: Elects a single thread within a warp to perform an operation. -
elect_one_sync_with_mask
: Elects a single thread within a warp to perform an operation.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!