Mojo function

write_bf16x2_row_to_smem_chunked

write_bf16x2_row_to_smem_chunked[layout: Layout, *, out_dtype: DType, in_dtype: DType, config: MLA_SM100_Decode_Config, local_tile_size: Int, chunk_size: Int = 16, scale_needed: Bool = False](shared_mem: UnsafePointer[Scalar[out_dtype], MutAnyOrigin, address_space=AddressSpace.SHARED], local_mem: LayoutTensor[in_dtype, layout, MutAnyOrigin, address_space=AddressSpace.LOCAL], col_start: Int, row_start: Int, scale: Scalar[in_dtype] = 1)

Chunked write with optional scaling. Reduces register pressure.