Fused-RoPE Attention with q_offset and k_offset #701

qiyuxinlin · 2024-12-26T07:23:15Z

As you mentioned in your first article, I now have a requirement to modify the KVCache. The best way to do this is by using your Fused-RoPE Attention, but it can only be applied when the positions of Q and K are sequential. I studied your code, and in prefill.cuh, I noticed that you actually left interfaces to get the positions for Q and K. When I run it with a single batch, it works correctly. I would like to ask if these two interfaces are not exposed for any particular bug reasons.

yzh119 · 2024-12-26T07:52:56Z

if these two interfaces are not exposed for any particular bug reasons.

It's only because I haven't got time to work on that... MLC-LLM uses the C++ APIs but we haven't exposed it in Python.
We welcome contributions from the community :)

Added to roadmap: #675

yzh119 mentioned this issue Dec 26, 2024

[Roadmap] FlashInfer v0.2 to v0.3 #675

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused-RoPE Attention with q_offset and k_offset #701

Fused-RoPE Attention with q_offset and k_offset #701

qiyuxinlin commented Dec 26, 2024

yzh119 commented Dec 26, 2024

Fused-RoPE Attention with q_offset and k_offset #701

Fused-RoPE Attention with q_offset and k_offset #701

Comments

qiyuxinlin commented Dec 26, 2024

yzh119 commented Dec 26, 2024