Mojo function
batched_copy_dram_to_local
batched_copy_dram_to_local[src_thread_layout: Layout, num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0), block_dim_count: Int = 1](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])
Copies data from global memory (DRAM) to registers (LOCAL) in a batched manner.
This function utilizes global_load_dwordx4/global_load_dwordx2 to load data from global memory to local memory in a batched manner.
Notes:
- This function is particularly used for warp specialization.
Parameters:
- src_thread_layout (
Layout
): The layout used to distribute the threads for coalesced loads. - num_threads (
Int
): Total number of threads participating in the copy operation. Defaults to the size of thread_layout. - thread_scope (
ThreadScope
): Defines whether operations are performed atBLOCK
orWARP
level.BLOCK
scope involves all threads in a thread block, whileWARP
scope restricts operations to threads within the same warp. Defaults toThreadScope.BLOCK
. - block_dim_count (
Int
): The number of dimensions in the thread block.
Args:
- dst (
LayoutTensor
): The destination tensor in register memory (LOCAL address space). - src (
LayoutTensor
): The source tensor in global memory (DRAM) to be copied.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!