AI & ML Papers

🔥 LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

💡 The paper presents LMCache, an efficient key-value cache layer for large language model inference at the enterprise scale. The problem addressed is the traditional storage of key-value caches in GPU memory, which limits cache reuse across different queries and inference engines. As the total key-value cache stored by users grows rapidly, exceeding the capacity of GPU memory, there is a need to move caches outside GPU devices.

The authors propose LMCache as a solution, which extracts and stores key-value caches generated by modern large language model engines out of the GPU memory and shares them across engines and queries. LMCache supports cache offloading and prefill-decode disaggregation, allowing for cross-engine and GPU cache transfer. The key contributions of LMCache include highly optimized key-value cache data movement, a modular cache connector component that decouples LMCache from the evolution of inference engines, and a control API for flexible cache orchestration across different layers.

The evaluation of LMCache shows significant improvements in throughput, with up to 15 times improvement when combined with a large language model engine. The adoption of LMCache in enterprise settings provides valuable insights, such as the benefits of fetching key-value caches from remote storage and the impact of context truncation on prefix cache hit ratio. Overall, LMCache is presented as an efficient and open-source key-value caching solution that addresses the need for efficient cache management in large language model inference.

📅 Published on Oct 8, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.09665
• PDF: https://arxiv.org/pdf/2510.09665
• Project Page: https://huggingface.co/collections/dvps/dvps-scientific-watch

🤖 Models citing this paper:
• https://huggingface.co/enfinity7B/apac

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #LLMInference #KVCacheOptimization #EnterpriseScaleAI #GPUAcceleratedInference

GitHub

Hugging Face

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤2

727 views05:54

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform