AI & ML Papers
Photo
🔥 Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
📅 Published on May 12
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.12825
• PDF: https://arxiv.org/pdf/2605.12825
🤖 Models citing this paper:
• https://huggingface.co/chiennv/Orthrus-Qwen3-8B
• https://huggingface.co/chiennv/Orthrus-Qwen3-4B
• https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DiffusionLanguageModels #ParallelTokenGeneration #AutoregressiveDecoding #DualViewDiffusion #LargeLanguageModels
💡 The paper introduces Orthrus, a dual architecture framework that combines the strengths of autoregressive large language models and diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity. The problem with standard autoregressive decoding is that it is sequential, which represents a fundamental bottleneck for high throughput inference. Diffusion language models try to address this issue with parallel generation, but they suffer from performance degradation, high training costs, and lack of convergence guarantees.
The Orthrus framework resolves this issue by augmenting a frozen large language model with a lightweight trainable module to create a parallel diffusion view alongside the standard autoregressive view. Both views attend to the same high fidelity key value cache, where the autoregressive head executes context pre filling to construct accurate key value representations, and the diffusion head executes parallel generation. The framework employs an exact consensus mechanism between the two views to guarantee lossless inference.
The results show that Orthrus delivers a speedup of up to 7.8 times with only a constant memory cache overhead and minimal parameter additions. This is achieved by sharing key value caches and using a consensus mechanism, which allows the framework to maintain exact inference fidelity while generating tokens in parallel. Overall, the Orthrus framework provides a simple and efficient solution to the problem of slow sequential decoding in autoregressive large language models, and it has the potential to be seamlessly integrated into existing transformer architectures.
📅 Published on May 12
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.12825
• PDF: https://arxiv.org/pdf/2605.12825
🤖 Models citing this paper:
• https://huggingface.co/chiennv/Orthrus-Qwen3-8B
• https://huggingface.co/chiennv/Orthrus-Qwen3-4B
• https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DiffusionLanguageModels #ParallelTokenGeneration #AutoregressiveDecoding #DualViewDiffusion #LargeLanguageModels
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.