Mojo function

copy

copy[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Synchronously copy data from local memory (registers) to SRAM (shared memory).

This function performs a synchronous copy operation from register memory to shared memory in a GPU context, distributing the workload across multiple threads for parallel execution. It's particularly useful for transferring processed data from registers to shared memory for inter-thread communication.

Constraints:

Destination tensor must be in SHARED address space.
Source tensor must be in LOCAL address space.
For optimal performance, the thread layout should match the memory access patterns of the tensors.

Parameters: thread_layout: Layout defining how threads are organized for the operation. This determines how the workload is distributed among threads. swizzle: Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. thread_scope: Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK. row_major: Whether to use row-major ordering for the copy operation. This is particularly relevant when prefetching from DRAM to SRAM via registers on AMD GPUs. Defaults to False.

Args: dst: The destination tensor, which must be in shared memory (SRAM). src: The source tensor, which must be in local memory (registers).

Example:

from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.SHARED]()

# Process data in registers
# ...

# Copy processed data to shared memory for inter-thread communication
copy[Layout((8, 8))](shared_data, register_data)
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.SHARED]()

# Process data in registers
# ...

# Copy processed data to shared memory for inter-thread communication
copy[Layout((8, 8))](shared_data, register_data)

Performance:

Distributes the copy workload across multiple threads for parallel execution.
Can use swizzling to optimize memory access patterns and reduce bank conflicts.
Optimized for transferring data from registers to shared memory.
On AMD GPUs, the row_major parameter can be used to match the memory access pattern used during prefetching from DRAM to registers.

Notes:

The destination tensor must be in SHARED address space (SRAM).
The source tensor must be in LOCAL address space (registers).
This function is particularly useful in GPU kernels for sharing processed data between threads in the same block.
The row_major parameter is specifically designed for AMD GPUs when using a prefetching pattern from DRAM to SRAM via registers.