PagedAttention
PagedAttention is a memory management technique designed to improve GPU memory utilization during large language model (LLM) serving. Inspired by classical virtual memory and paging methods used in operating systems, PagedAttention divides the KV cache into fixed-size blocks, which are not necessarily stored contiguously in memory. This approach enables more efficient handling of dynamic states in LLMs, allowing the model to manage large context sizes while optimizing memory usage, as described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon, et al., 2023).
Also written as "paged attention."
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
