Mojo function
multimem_ld_reduce
multimem_ld_reduce[type: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[::DType,::DType](), output_width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]) -> StaticTuple[SIMD[accum_type, output_width], count]
Performs a vectorized load-reduce operation using NVIDIA's multimem feature.
This function loads multiple values from global memory and performs a reduction operation across them in a single instruction. It utilizes NVIDIA's multimem feature available on SM90+ GPUs for improved performance.
Constraints:
- Only supported on SM90+ GPUs.
- Count must be 2 or 4.
- Type must be float32, float16, or bfloat16.
Parameters:
- type (
DType
): Data type for the operation (float32, float16, or bfloat16). - count (
Int
): Number of elements to load and reduce (2 or 4). - reduction (
ReduceOp
): Type of reduction operation to perform. - scope (
Scope
): Memory scope for the operation. - consistency (
Consistency
): Memory consistency model to use. - accum_type (
DType
): Data type used for accumulation. Defaults to a wider type than input (e.g. float32 for float16 inputs) to maintain precision during reduction. - output_width (
Int
): Width of each output SIMD vector (default 1).
Args:
- addr (
UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]
): Pointer to global memory where data will be loaded from.
Returns:
A StaticTuple containing 'count' SIMD vectors of width 'output_width' holding the results of the load-reduce operation.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!