Mojo function
load_matrix_b_amd_rdna16x16x16
load_matrix_b_amd_rdna16x16x16(b_ptr: UnsafePointer[Float16], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[DType.float16, 16]
Loads 16×16×16 matrix B tile for RDNA (Wave32) architecture.
This function is optimized for AMD RDNA GPUs (Radeon RX 7000 series) which use Wave32 execution mode. Each thread loads 16 contiguous FP16 elements using an access pattern appropriate for WMMA instructions.
Notes: The concrete return type (SIMD[16]) avoids type ambiguity and padding overhead. This function is architecture-specific for RDNA - for CDNA, use load_matrix_b_amd_cdna16x16x16() which returns SIMD[4].
Args:
- b_ptr (
UnsafePointer): Pointer to matrix B data in memory. - tile_row (
Int): Starting row index of the tile. - tile_col (
Int): Starting column index of the tile. - ldm (
Int): Leading dimension of matrix B (stride between rows).
Returns:
SIMD: SIMD vector containing 16 FP16 values for this thread.
load_matrix_b_amd_rdna16x16x16(b_ptr: UnsafePointer[BFloat16], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[DType.bfloat16, 16]
Loads 16×16×16 matrix B tile for RDNA (Wave32) architecture.
This function is optimized for AMD RDNA GPUs (Radeon RX 7000 series) which use Wave32 execution mode. Each thread loads 16 contiguous BF16 elements using an access pattern appropriate for WMMA instructions.
Notes: The concrete return type (SIMD[16]) avoids type ambiguity and padding overhead. This function is architecture-specific for RDNA - for CDNA, use load_matrix_b_amd_cdna16x16x16() which returns SIMD[4].
Args:
- b_ptr (
UnsafePointer): Pointer to matrix B data in memory. - tile_row (
Int): Starting row index of the tile. - tile_col (
Int): Starting column index of the tile. - ldm (
Int): Leading dimension of matrix B (stride between rows).
Returns:
SIMD: SIMD vector containing 16 BF16 values for this thread.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!