Mojo function

multimem_ld_reduce

multimem_ld_reduce[dtype: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[dtype](), output_width: Int = 1](addr: UnsafePointer[Scalar[dtype], address_space=AddressSpace(1), mut=False, origin=origin]) -> StaticTuple[SIMD[dtype, output_width], count]

Performs a vectorized load-reduce operation using NVIDIA's multimem feature.

This function loads multiple values from global memory and performs a reduction operation across them in a single instruction. It utilizes NVIDIA's multimem feature available on SM90+ GPUs for improved performance.

Constraints:

Only supported on SM90+ GPUs.
Total bit width (count * output_width * size_of[dtype] * 8) must be 32, 64, or 128 bits.
Type must be a floating point type.
float64 requires count=1 (no .vec qualifier allowed).

Parameters:

dtype (DType): Data dtype for the operation (must be a floating point type).
count (Int): Vector size for PTX (corresponds to .v2, .v4, .v8 qualifiers, or no .v for scalar).
reduction (ReduceOp): Type of reduction operation to perform.
scope (Scope): Memory scope for the operation.
consistency (Consistency): Memory consistency model to use.
accum_type (DType): Data dtype used for accumulation. Defaults to a wider dtype than input (e.g. float32 for float16 inputs) to maintain precision during reduction.
output_width (Int): Number of elements packed into a single output register (e.g. bf16x2).

Args:

addr (UnsafePointer): Pointer to global memory where data will be loaded from.

Returns:

StaticTuple: A StaticTuple containing 'count' SIMD vectors of width 'output_width' holding the results of the load-reduce operation.

multimem_ld_reduce[dtype: DType, *, simd_width: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[dtype]()](addr: UnsafePointer[Scalar[dtype], address_space=AddressSpace(1), mut=False, origin=origin]) -> SIMD[dtype, simd_width]

Simplified multimem_ld_reduce that automatically calculates optimal packing.

This wrapper automatically determines the optimal output_width and count parameters based on the requested simd_width and data type, using 32-bit word packing for efficiency.

Constraints:

Only supported on SM90+ GPUs.
simd_width must be 1, 2, 4, or 8.
Total bit width (count * output_width * size_of[dtype] * 8) must be 32, 64, or 128 bits.
Type must be a floating point type.
float64 requires count=1 (no .vec qualifier allowed).

Parameters:

dtype (DType): Data dtype for the operation (must be a floating point type).
simd_width (Int): Total number of elements to process.
reduction (ReduceOp): Type of reduction operation to perform.
scope (Scope): Memory scope for the operation.
consistency (Consistency): Memory consistency model to use.
accum_type (DType): Data dtype used for accumulation.

Returns:

SIMD: A SIMD vector containing simd_width elements with the reduction results.