Mojo function
copy
copy[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])
Synchronously copy data from local memory (registers) to SRAM (shared memory).
This function performs a synchronous copy operation from register memory to shared memory in a GPU context, distributing the workload across multiple threads for parallel execution. It's particularly useful for transferring processed data from registers to shared memory for inter-thread communication.
Constraints:
- Destination tensor must be in SHARED address space.
- Source tensor must be in LOCAL address space.
- For optimal performance, the thread layout should match the memory access patterns of the tensors.
Parameters:
thread_layout: Layout defining how threads are organized for the
operation. This determines how the workload is distributed among
threads.
swizzle: Optional swizzling function to rearrange the destination
indices, which can improve memory access patterns and reduce bank
conflicts.
thread_scope: Defines whether operations are performed at BLOCK
or
WARP
level. BLOCK
scope involves all threads in a thread block,
while WARP
scope restricts operations to threads within the same
warp. Defaults to ThreadScope.BLOCK
.
row_major: Whether to use row-major ordering for the copy operation.
This is particularly relevant when prefetching from DRAM to SRAM
via registers on AMD GPUs. Defaults to False.
Args: dst: The destination tensor, which must be in shared memory (SRAM). src: The source tensor, which must be in local memory (registers).
Example:
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.SHARED]()
# Process data in registers
# ...
# Copy processed data to shared memory for inter-thread communication
copy[Layout((8, 8))](shared_data, register_data)
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
address_space=AddressSpace.SHARED]()
# Process data in registers
# ...
# Copy processed data to shared memory for inter-thread communication
copy[Layout((8, 8))](shared_data, register_data)
Performance:
- Distributes the copy workload across multiple threads for parallel execution.
- Can use swizzling to optimize memory access patterns and reduce bank conflicts.
- Optimized for transferring data from registers to shared memory.
- On AMD GPUs, the
row_major
parameter can be used to match the memory access pattern used during prefetching from DRAM to registers.
Notes:
- The destination tensor must be in
SHARED
address space (SRAM). - The source tensor must be in
LOCAL
address space (registers). - This function is particularly useful in GPU kernels for sharing processed data between threads in the same block.
- The
row_major
parameter is specifically designed for AMD GPUs when using a prefetching pattern from DRAM to SRAM via registers.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!