🔥 KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
📅 Published on Jun 2
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.03458
• PDF: https://arxiv.org/pdf/2606.03458
• Project Page: https://github.com/huawei-csl/KVarN
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#KVCacheQuantization #AutoregressiveDecoding #LargeLanguageModels #ErrorAccumulationMitigation #QuantizationMethodsForReasoningTasks
💡 The paper introduces KVarN, a new method for quantizing KV-cache in large language models to reduce error accumulation during autoregressive decoding. The problem addressed is that test-time scaling, which improves reasoning in large language models, becomes memory-bottlenecked during long-horizon decoding due to the growing KV-cache. Existing KV-cache quantization methods are not effective in this setting because they are evaluated under prefill-like settings, where errors behave differently than in autoregressive decoding. In autoregressive decoding, quantization errors accumulate across timesteps, primarily due to incorrect token scales.
The KVarN method addresses this issue by applying a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. This combination fixes outlying token-scale errors and substantially reduces error accumulation. The method is calibration-free, meaning it does not require any additional calibration steps.
The results show that KVarN establishes a new state-of-the-art for KV-cache quantization on generative benchmarks, including MATH500, AIME24, and HumanEval, at 2-bit precision. This means that KVarN is able to achieve better performance than existing methods while using less memory. The KVarN method is also available for implementation in large language models, providing a practical solution to the problem of error accumulation in autoregressive decoding. Overall, the paper contributes a new and effective method for quantizing KV-cache in large language models, which can improve the performance and efficiency of these models in reasoning tasks.
📅 Published on Jun 2
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.03458
• PDF: https://arxiv.org/pdf/2606.03458
• Project Page: https://github.com/huawei-csl/KVarN
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#KVCacheQuantization #AutoregressiveDecoding #LargeLanguageModels #ErrorAccumulationMitigation #QuantizationMethodsForReasoningTasks
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.