Mojo function
copy_dram_to_local
copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])
Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.
This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput.
Constraints:
- Only supported on AMD GPUs.
- The destination element layout size must match the SIMD width.
- Source fragments must be rank 2 with known dimensions.
Notes:
- The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility.
- This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency.
Parameters:
- src_thread_layout (
Layout
): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. - thread_scope (
ThreadScope
): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults toThreadScope.BLOCK
.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor in register memory (LOCAL address space). - src (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The source tensor in global memory (DRAM) to be copied. - src_base (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The original global memory tensor from which src is derived. This is used to construct the buffer descriptor required by AMD's buffer_load intrinsic.
copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_bitwidth=layout_bitwidth, masked=masked], bounds: Int)
Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.
This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput.
Constraints:
- Only supported on AMD GPUs.
- The destination element layout size must match the SIMD width.
- Source fragments must be rank 2 with known dimensions.
Notes:
- The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility.
- This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency.
Parameters:
- src_thread_layout (
Layout
): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. - thread_scope (
ThreadScope
): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults toThreadScope.BLOCK
.
Args:
- dst (
LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]
): The destination tensor in register memory (LOCAL address space). - src_iter (
LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_bitwidth=layout_bitwidth, masked=masked]
): The source tensor iterator. - bounds (
Int
): Bounds of the buffer, based on the ptr of the src_iter.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!