Skip to main content
Log in

Mojo function

multimem_ld_reduce

multimem_ld_reduce[type: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[::DType,::DType](), output_width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]) -> StaticTuple[SIMD[accum_type, output_width], count]

Performs a vectorized load-reduce operation using NVIDIA's multimem feature.

This function loads multiple values from global memory and performs a reduction operation across them in a single instruction. It utilizes NVIDIA's multimem feature available on SM90+ GPUs for improved performance.

Constraints:

  • Only supported on SM90+ GPUs. - Count must be 2 or 4. - Type must be float32, float16, or bfloat16.

Parameters:

  • type (DType): Data type for the operation (float32, float16, or bfloat16).
  • count (Int): Number of elements to load and reduce (2 or 4).
  • reduction (ReduceOp): Type of reduction operation to perform.
  • scope (Scope): Memory scope for the operation.
  • consistency (Consistency): Memory consistency model to use.
  • accum_type (DType): Data type used for accumulation. Defaults to a wider type than input (e.g. float32 for float16 inputs) to maintain precision during reduction.
  • output_width (Int): Width of each output SIMD vector (default 1).

Args:

  • addr (UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]): Pointer to global memory where data will be loaded from.

Returns:

A StaticTuple containing 'count' SIMD vectors of width 'output_width' holding the results of the load-reduce operation.