Mojo function

gemv_split_k

gemv_split_k[c_type: DType, a_type: DType, b_type: DType, c_layout: Layout, a_layout: Layout, b_layout: Layout, simd_width: Scalar[DType.uint], tile_m: Scalar[DType.uint], tile_n: Scalar[DType.uint], num_threads: Scalar[DType.uint], elementwise_lambda_fn: Optional[elementwise_epilogue_type] = None, s_type: DType = get_accum_type[c_type](), check_bounds: Bool = True, pdl_level: PDLLevel = PDLLevel()](output: LayoutTensor[c_type, c_layout, MutAnyOrigin], act: LayoutTensor[a_type, a_layout, ImmutAnyOrigin], weight: LayoutTensor[b_type, b_layout, ImmutAnyOrigin], m: Int, n: Int, k: Int)

GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K). The impl can actually handle M > 1 but it's only optimal for tiny M. We use it for M = 1 only.