Mojo function

flash_attention_split_kv

flash_attention_split_kv[dtype: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[dtype, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[dtype, $0], input_k_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[dtype, $0], input_v_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[dtype, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[dtype, $0]](q: NDBuffer[dtype, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], k_cache_shape: IndexList[(rank + 1)], v_cache_shape: IndexList[(rank + 1)], mask_shape: IndexList[mask_rank], output: NDBuffer[dtype, rank, origin, shape, strides], scale: SIMD[float32, 1])

Variant of flash attention that takes the previous KV cache input_{k,v}_cache_fn and the current KV tensors input_k_fn and input_v_fn as separate arguments.

This works around the fact that fusion can't currently look through concat. So this kernel does an in-place concat fusion by changing the input lambdas input_{k,v}_cache_fn_wrapper to take previous sequence KV elements from the KV cache, and current KV elements from tensors k and v.