Mojo function

mha_decoding_single_batch_pipelined

mha_decoding_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = 1, use_score_mod: Bool = False, decoding_warp_split_k: Bool = False, sink: Bool = False](q_ptr: LegacyUnsafePointer[Scalar[q_type]], k: k_t, v: v_t, output_ptr: LegacyUnsafePointer[Scalar[output_type]], exp_sum_ptr: LegacyUnsafePointer[Scalar[get_accum_type[q_type]()]], qk_max_ptr: LegacyUnsafePointer[Scalar[get_accum_type[q_type]()]], scale: Float32, num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, sink_weights: OptionalReg[LayoutTensor[q_type, Layout.row_major(-1), MutAnyOrigin]], mask: mask_t, score_mod: score_mod_t, batch_idx: Int)

Flash attention v2 algorithm.