AI & ML Papers
Photo
🔥 Efficient Memory Management for Large Language Model Serving with PagedAttention
📅 Published on Sep 12, 2023
🔗 Links:
• arXiv: https://arxiv.org/abs/2309.06180
• PDF: https://arxiv.org/pdf/2309.06180
• GitHub: https://github.com/vllm-project/vllm ⭐ 79.0k
🤖 Models citing this paper:
• https://huggingface.co/theonlyengine/Flash-attention1
• https://huggingface.co/enfinity7B/apac
📊 Datasets citing this paper:
• https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Vrushali777/vllm-inference-benchmark
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#LargeLanguageModels #EfficientMemoryManagement #PagedAttention #LanguageModelServing #KeyValueCacheOptimization
💡 The paper addresses the problem of efficient memory management for large language models, which is crucial for high throughput serving. Existing systems struggle with managing the key-value cache memory, which is huge and dynamically grows and shrinks, resulting in significant waste due to fragmentation and redundant duplication. To solve this problem, the authors propose PagedAttention, an attention algorithm inspired by classical virtual memory and paging techniques in operating systems. They also build vLLM, a large language model serving system that achieves near-zero waste in key-value cache memory and flexible sharing of the cache within and across requests. The vLLM system is designed to reduce memory usage and improve throughput. The authors evaluate vLLM and show that it improves the throughput of popular large language models by 2-4 times with the same level of latency compared to state-of-the-art systems. The improvement is more significant with longer sequences, larger models, and more complex decoding algorithms. Overall, the paper contributes to the development of efficient memory management for large language models, enabling higher throughput and better performance.
📅 Published on Sep 12, 2023
🔗 Links:
• arXiv: https://arxiv.org/abs/2309.06180
• PDF: https://arxiv.org/pdf/2309.06180
• GitHub: https://github.com/vllm-project/vllm ⭐ 79.0k
🤖 Models citing this paper:
• https://huggingface.co/theonlyengine/Flash-attention1
• https://huggingface.co/enfinity7B/apac
📊 Datasets citing this paper:
• https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Vrushali777/vllm-inference-benchmark
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#LargeLanguageModels #EfficientMemoryManagement #PagedAttention #LanguageModelServing #KeyValueCacheOptimization
arXiv.org
Efficient Memory Management for Large Language Model Serving with...
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for...
❤5