Mojo function
generic_fused_qk_rope_bshd_continuous_batch
generic_fused_qk_rope_bshd_continuous_batch[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())
Performs a fused RoPE projection for Q and K projections.
We have a manually fused QKV projection with mo.opaque types in our Llama model. Due to a limitation in custom op definitions, we can't declare both a tensor and opaque type as output from a custom kernel. This requires us to only note Q_proj as an output from the QKV projection. If we immediately follow the QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition because the graph compiler doesn't know about the dependency between these kernels in the graph definition. Here we fuse the RoPE kernel applied to Q_proj with K_proj, so K_proj RoPE is only executed after QKV completes.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!