AI & ML Papers
32.8K subscribers
7.07K photos
523 videos
24 files
7.72K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 Efficient Memory Management for Large Language Model Serving with PagedAttention

💡 The paper addresses the problem of efficient memory management for large language models, which is crucial for high throughput serving. Existing systems struggle with managing the key-value cache memory, which is huge and dynamically grows and shrinks, resulting in significant waste due to fragmentation and redundant duplication. To solve this problem, the authors propose PagedAttention, an attention algorithm inspired by classical virtual memory and paging techniques in operating systems. They also build vLLM, a large language model serving system that achieves near-zero waste in key-value cache memory and flexible sharing of the cache within and across requests. The vLLM system is designed to reduce memory usage and improve throughput. The authors evaluate vLLM and show that it improves the throughput of popular large language models by 2-4 times with the same level of latency compared to state-of-the-art systems. The improvement is more significant with longer sequences, larger models, and more complex decoding algorithms. Overall, the paper contributes to the development of efficient memory management for large language models, enabling higher throughput and better performance.


📅 Published on Sep 12, 2023

🔗 Links:
• arXiv: https://arxiv.org/abs/2309.06180
• PDF: https://arxiv.org/pdf/2309.06180
• GitHub: https://github.com/vllm-project/vllm 79.0k

🤖 Models citing this paper:
https://huggingface.co/theonlyengine/Flash-attention1
https://huggingface.co/enfinity7B/apac

📊 Datasets citing this paper:
https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1

🚀 Spaces citing this paper:
https://huggingface.co/spaces/Vrushali777/vllm-inference-benchmark

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #EfficientMemoryManagement #PagedAttention #LanguageModelServing #KeyValueCacheOptimization
5