Mojo function
flare_mla_decoding
flare_mla_decoding[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))
MLA decoding kernel that would only be called in the optimized compute graph.
The Q input has a shape of [seq_len, num_heads, depth]. The K input has a shape of [seq_len, 1, depth]. The V tensor is derived by reusing K, where V = K[:, :, :depth_v].
Specifically, for DeepSeek V2/3, depth = 576 and depth_v = 512.
This kernel computes attention without needing to load V twice. This kernel only handles decoding requests. In this case q_max_seq_len = 1.
This kernel handles batches with different valid lengths (i.e., before the padding). Such lengths are passed in valid_length argument.
flare_mla_decoding[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!