AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

💡 The paper introduces Orthrus, a dual architecture framework that combines the strengths of autoregressive large language models and diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity. The problem with standard autoregressive decoding is that it is sequential, which represents a fundamental bottleneck for high throughput inference. Diffusion language models try to address this issue with parallel generation, but they suffer from performance degradation, high training costs, and lack of convergence guarantees.

The Orthrus framework resolves this issue by augmenting a frozen large language model with a lightweight trainable module to create a parallel diffusion view alongside the standard autoregressive view. Both views attend to the same high fidelity key value cache, where the autoregressive head executes context pre filling to construct accurate key value representations, and the diffusion head executes parallel generation. The framework employs an exact consensus mechanism between the two views to guarantee lossless inference.

The results show that Orthrus delivers a speedup of up to 7.8 times with only a constant memory cache overhead and minimal parameter additions. This is achieved by sharing key value caches and using a consensus mechanism, which allows the framework to maintain exact inference fidelity while generating tokens in parallel. Overall, the Orthrus framework provides a simple and efficient solution to the problem of slow sequential decoding in autoregressive large language models, and it has the potential to be seamlessly integrated into existing transformer architectures.


📅 Published on May 12

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.12825
• PDF: https://arxiv.org/pdf/2605.12825

🤖 Models citing this paper:
https://huggingface.co/chiennv/Orthrus-Qwen3-8B
https://huggingface.co/chiennv/Orthrus-Qwen3-4B
https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DiffusionLanguageModels #ParallelTokenGeneration #AutoregressiveDecoding #DualViewDiffusion #LargeLanguageModels
🔥 KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

💡 The paper introduces KVarN, a new method for quantizing KV-cache in large language models to reduce error accumulation during autoregressive decoding. The problem addressed is that test-time scaling, which improves reasoning in large language models, becomes memory-bottlenecked during long-horizon decoding due to the growing KV-cache. Existing KV-cache quantization methods are not effective in this setting because they are evaluated under prefill-like settings, where errors behave differently than in autoregressive decoding. In autoregressive decoding, quantization errors accumulate across timesteps, primarily due to incorrect token scales.

The KVarN method addresses this issue by applying a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. This combination fixes outlying token-scale errors and substantially reduces error accumulation. The method is calibration-free, meaning it does not require any additional calibration steps.

The results show that KVarN establishes a new state-of-the-art for KV-cache quantization on generative benchmarks, including MATH500, AIME24, and HumanEval, at 2-bit precision. This means that KVarN is able to achieve better performance than existing methods while using less memory. The KVarN method is also available for implementation in large language models, providing a practical solution to the problem of error accumulation in autoregressive decoding. Overall, the paper contributes a new and effective method for quantizing KV-cache in large language models, which can improve the performance and efficiency of these models in reasoning tasks.


📅 Published on Jun 2

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.03458
• PDF: https://arxiv.org/pdf/2606.03458
• Project Page: https://github.com/huawei-csl/KVarN

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#KVCacheQuantization #AutoregressiveDecoding #LargeLanguageModels #ErrorAccumulationMitigation #QuantizationMethodsForReasoningTasks