Mojo function

copy_local_to_local

copy_local_to_local(dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])

Synchronously copy data between local memory (register) tensors with type conversion.

This function performs a synchronous copy operation between register tensors in a GPU context, with support for converting from float32 to half-precision formats (bfloat16/float16). It's particularly optimized for specific tensor layouts commonly used in matrix multiplication operations.

Example:

from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local

var src_reg = LayoutTensor[DType.float32, Layout((16, 8)),
                            address_space=AddressSpace.LOCAL]()
var dst_reg = LayoutTensor[DType.bfloat16, Layout((16, 8)),
                            address_space=AddressSpace.LOCAL]()

# Process data in float32 registers
# ...

# Convert and copy to bfloat16 registers
copy_local_to_local(dst_reg, src_reg)
from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local

var src_reg = LayoutTensor[DType.float32, Layout((16, 8)),
                            address_space=AddressSpace.LOCAL]()
var dst_reg = LayoutTensor[DType.bfloat16, Layout((16, 8)),
                            address_space=AddressSpace.LOCAL]()

# Process data in float32 registers
# ...

# Convert and copy to bfloat16 registers
copy_local_to_local(dst_reg, src_reg)

Performance:

Optimized for specific 2D tensor layouts with contiguous inner dimensions.
Special fast path for 2D tensors with specific layouts used in matrix multiplication.
For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the conversion between output fragments and input fragments with different layouts.
Falls back to element-wise copy for general cases.

Notes:

Both source and destination tensors must be in LOCAL address space (registers).
This function currently only supports copying from float32 to half-precision formats.
For 2D tensors with stride[1] == 1, a specialized fast path is used that's optimized for matrix multiplication patterns.
This function is particularly useful in GPU kernels for converting between different precision formats while keeping data in registers.

Constraints:

Destination tensor must be in LOCAL address space.
Source tensor must be in LOCAL address space.
Destination tensor must have a half-precision floating-point data type.
Source tensor must have float32 data type.
Both tensors must have the same total size.

Args:

dst (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The destination tensor, which must be in local memory (registers) and have a half-precision floating-point data type (bfloat16 or float16).
src (LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]): The source tensor, which must be in local memory (registers) and have float32 data type.