AI & ML Papers

🔥 OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

💡 The paper introduces OScaR, a novel framework for compressing Key-Value caches in large language models, which is a major memory bottleneck for efficient deployment. The existing per-channel quantization method is limited by Token Norm Imbalance, where errors are amplified when quantization parameters are shared across tokens with different norms. To address this, OScaR uses Canalized Rotation and Omni-Token Scaling to reduce the impact of Token Norm Imbalance, resulting in a more accurate and efficient compression framework.

The method works by first applying Canalized Rotation to mitigate the sequence-dimensional variance caused by Token Norm Imbalance, and then applying Omni-Token Scaling to further reduce the errors. This approach is supported by an optimized system design and CUDA kernels, making it a lightweight and efficient solution.

The paper evaluates OScaR on various large language models, including text-only, multi-modal, and omni-modal models, and shows that it consistently outperforms existing methods. The results demonstrate that OScaR achieves near-lossless performance under INT2 quantization, and provides a significant improvement in memory efficiency and decoding speed. Compared to the baseline, OScaR achieves a 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available, making it a robust, low-complexity, and universal framework for KV cache compression. Overall, the paper contributes a new approach to addressing the memory bottleneck in large language models, and provides a significant improvement in efficiency and performance.

📅 Published on May 19

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.19660
• PDF: https://arxiv.org/pdf/2605.19660
• Project Page: https://iridescent-gcrace.github.io/OScaR/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LLMCompression #KeyValueCacheQuantization #ExtremeQuantizationTechniques #TokenNormImbalance #EfficientLLMDeployment

GitHub

Hugging Face

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤2

501 views17:49

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform