Skip to main content
Log in

Mojo function

copy_local_to_dram

copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])

Efficiently copy data from registers (LOCAL) to global memory (DRAM).

This function implements a high-performance memory transfer operation from register memory to global memory. It distributes the copy operation across multiple threads for maximum throughput while handling bounds checking for safety.

Constraints:

  • The source tensor must be in LOCAL address space (registers).
  • The destination tensor must be in GENERIC or GLOBAL address space (DRAM).
  • Both tensors must have compatible data types.

Parameters:

  • dst_thread_layout (Layout): The layout used to distribute the destination tensor across threads. This determines how the workload is divided among participating threads.
  • thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.

Args:

  • dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor in global memory (DRAM).
  • src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor in register memory (LOCAL) to be copied.

copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment], dst_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment])

Efficiently copy data from registers (LOCAL) to global memory (DRAM) on AMD GPUs.

This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer_store intrinsic to efficiently transfer data from registers to global memory while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput.

Constraints:

  • Only supported on AMD GPUs.
  • Destination tensor must be in GLOBAL address space.
  • Source tensor must be in LOCAL address space.
  • Data types must match between source and destination tensors.

Notes:

  • This function is particularly useful for writing computed results from registers back to global memory with minimal latency.
  • The offset calculation is optimized for performance rather than flexibility.

Parameters:

  • dst_thread_layout (Layout): The layout used to distribute the destination tensor across threads. This determines how the workload is divided among participating threads.
  • thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.

Args:

  • dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The destination tensor in global memory (DRAM).
  • src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The source tensor in register memory (LOCAL address space) to be copied.
  • dst_base (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_bitwidth=layout_bitwidth, masked=masked, alignment=alignment]): The original global memory tensor from which dst is derived. This is used to construct the buffer descriptor required by AMD's buffer_store intrinsic.