Skip to main content

Mojo function

multimem_st

multimem_st[dtype: DType, *, count: Int, scope: Scope, consistency: Consistency, width: Int = 1](addr: UnsafePointer[Scalar[dtype], origin, address_space=AddressSpace.GLOBAL], values: StaticTuple[SIMD[dtype, width], count])

Stages an inline multimem.st instruction.

This operation performs a store to all memory locations pointed to by the multimem address using the specified memory consistency model and scope.

Notes:

  • Requires SM90+ GPU architecture (PTX ISA 8.1+).
  • The address must be a valid multimem address.
  • Supported dtype-width combinations must total 32/64/128 bits.
  • Default memory semantics: weak consistency (when not specified).
  • Vector stores (.v2/.v4/.v8) require matching total size constraints.
  • float64 requires count=1 (no .vec qualifier allowed).

Example:

from gpu.memory.memory import *

# Store 2 float32 values to multimem address.
multimem_st[DType.float32, count=2, scope=Scope.CTA, consistency=Consistency.RELAXED](
    addr, StaticTuple[DType.float32, 2](val1, val2)
)

# Vector store of 4 float16x2 values.
multimem_st[DType.float16, count=4, scope=Scope.CLUSTER, consistency=Consistency.RELEASE, width=2](
    addr, StaticTuple[DType.float16, 4](vec1, vec2, vec3, vec4)
)

See Also: PTX ISA Documentation.

Parameters:

  • dtype (DType): The data type of elements to store (must be a floating point type).
  • count (Int): Vector size for PTX (corresponds to .v2, .v4, .v8 qualifiers, or no .v for scalar).
  • scope (Scope): Memory scope for visibility of the store operation (CTA/Cluster/GPU/System).
  • consistency (Consistency): Memory consistency semantics (weak/relaxed/release).
  • width (Int): Vector width modifier for packed data types (default 1).

Args:

  • addr (UnsafePointer): Multimem address in global address space pointing to multiple locations.
  • values (StaticTuple): Packed SIMD values to store, with count matching the template parameter.

multimem_st[dtype: DType, *, simd_width: Int, scope: Scope, consistency: Consistency](addr: UnsafePointer[Scalar[dtype], origin, address_space=AddressSpace.GLOBAL], value: SIMD[dtype, simd_width])

Simplified multimem_st that automatically calculates optimal packing.

This wrapper automatically determines the optimal width and count parameters based on the requested simd_width and data type, using 32-bit word packing for efficiency.

Constraints:

  • Only supported on SM90+ GPUs.
  • dtype must be 4 bytes or smaller (float32, float16, bfloat16, etc.). For float64, use the explicit width/count overload.
  • simd_width must be 1, 2, 4, or 8.
  • Total bit width (count * width * size_of[dtype] * 8) must be 32, 64, or 128 bits.
  • Type must be a floating point type.

Parameters:

  • dtype (DType): Data type of the elements to store (must be a floating point type).
  • simd_width (Int): Total number of elements to store (must be 1, 2, 4, or 8).
  • scope (Scope): Memory scope for the operation.
  • consistency (Consistency): Memory consistency model to use.

Args:

  • addr (UnsafePointer): Multimem address in global address space pointing to multiple locations.
  • value (SIMD): SIMD vector containing simd_width elements to store.

Was this page helpful?