Mojo function

batched_copy_dram_to_local

batched_copy_dram_to_local[src_thread_layout: Layout, num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0), block_dim_count: Int = 1](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Copies data from global memory (DRAM) to registers (LOCAL) in a batched manner.

This function utilizes global_load_dwordx4/global_load_dwordx2 to load data from global memory to local memory in a batched manner.

Notes:

This function is particularly used for warp specialization.

Parameters:

src_thread_layout (Layout): The layout used to distribute the threads for coalesced loads.
num_threads (Int): Total number of threads participating in the copy operation. Defaults to the size of thread_layout.
thread_scope (ThreadScope): Defines whether operations are performed at BLOCK or WARP level. BLOCK scope involves all threads in a thread block, while WARP scope restricts operations to threads within the same warp. Defaults to ThreadScope.BLOCK.
block_dim_count (Int): The number of dimensions in the thread block.

Args:

dst (LayoutTensor): The destination tensor in register memory (LOCAL address space).
src (LayoutTensor): The source tensor in global memory (DRAM) to be copied.