Mojo function
ld_matrix
ld_matrix[type: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]) -> SIMD[type, simd_width]
Loads a matrix from shared memory into registers in a format suitable for tensor core operations.
This function performs a warp-synchronized load from shared memory to registers, formatting the data to be directly usable by tensor core Matrix Multiply-Accumulate (MMA) instructions.
Note: - All threads in a warp must execute this operation together. - For transposed loads, only half precision (float16) is supported. - The register width is fixed at 4 bytes (32 bits). - Supported configurations: - x1: One 32-bit register per thread. - x2: Two 32-bit registers per thread. - x4: Four 32-bit registers per thread.
Example: ```mojo # Load 8x8 matrix of float16 values var data = ld_matrixDType.float16, 8
# Load transposed matrix
var transposed = ld_matrix[DType.float16, 8, transpose=True](ptr)
```
.
# Load transposed matrix
var transposed = ld_matrix[DType.float16, 8, transpose=True](ptr)
```
.
Parameters:
- type (
DType
): The data type of the matrix elements (e.g. float16, float32). - simd_width (
Int
): The width of the SIMD vector to load. - transpose (
Bool
): Whether to transpose the matrix during load (only supported for half precision).
Args:
- ptr (
UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]
): Pointer to shared memory containing the source matrix data.
Returns:
SIMD vector containing the loaded matrix data, properly formatted for MMA operations.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!