Skip to main content
Log in

Mojo function

copy

copy[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Synchronously copy data from local memory (registers) to SRAM (shared memory).

This function performs a synchronous copy operation from register memory to shared memory in a GPU context, distributing the workload across multiple threads for parallel execution. It's particularly useful for transferring processed data from registers to shared memory for inter-thread communication.

Constraints:

  • Destination tensor must be in SHARED address space.
  • Source tensor must be in LOCAL address space.
  • For optimal performance, the thread layout should match the memory access patterns of the tensors.

Parameters: thread_layout: Layout defining how threads are organized for the operation. This determines how the workload is distributed among threads. swizzle: Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. thread_scope: Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK. row_major: Whether to use row-major ordering for the copy operation. This is particularly relevant when prefetching from DRAM to SRAM via registers on AMD GPUs. Defaults to False.

Args: dst: The destination tensor, which must be in shared memory (SRAM). src: The source tensor, which must be in local memory (registers).

Example:

from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.SHARED]()

# Process data in registers
# ...

# Copy processed data to shared memory for inter-thread communication
copy[Layout((8, 8))](shared_data, register_data)
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.SHARED]()

# Process data in registers
# ...

# Copy processed data to shared memory for inter-thread communication
copy[Layout((8, 8))](shared_data, register_data)

Performance:

  • Distributes the copy workload across multiple threads for parallel execution.
  • Can use swizzling to optimize memory access patterns and reduce bank conflicts.
  • Optimized for transferring data from registers to shared memory.
  • On AMD GPUs, the row_major parameter can be used to match the memory access pattern used during prefetching from DRAM to registers.

Notes:

  • The destination tensor must be in SHARED address space (SRAM).
  • The source tensor must be in LOCAL address space (registers).
  • This function is particularly useful in GPU kernels for sharing processed data between threads in the same block.
  • The row_major parameter is specifically designed for AMD GPUs when using a prefetching pattern from DRAM to SRAM via registers.