Skip to main content
Log in

Mojo function

copy_dram_to_local

copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])

Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.

This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput.

Constraints:

  • Only supported on AMD GPUs.
  • The destination element layout size must match the SIMD width.
  • Source fragments must be rank 2 with known dimensions.

Notes:

  • The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility.
  • This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency.

Parameters:

  • src_thread_layout (Layout): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads.
  • thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.

Args:

  • dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor in register memory (LOCAL address space).
  • src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor in global memory (DRAM) to be copied.
  • src_base (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The original global memory tensor from which src is derived. This is used to construct the buffer descriptor required by AMD's buffer_load intrinsic.

copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_bitwidth=layout_bitwidth, masked=masked], bounds: Int)

Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.

This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput.

Constraints:

  • Only supported on AMD GPUs.
  • The destination element layout size must match the SIMD width.
  • Source fragments must be rank 2 with known dimensions.

Notes:

  • The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility.
  • This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency.

Parameters:

  • src_thread_layout (Layout): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads.
  • thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.

Args:

  • dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor in register memory (LOCAL address space).
  • src_iter (LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_bitwidth=layout_bitwidth, masked=masked]): The source tensor iterator.
  • bounds (Int): Bounds of the buffer, based on the ptr of the src_iter.