AI & ML Papers
Photo
🔥 DFlash: Block Diffusion for Flash Speculative Decoding
📅 Published on Feb 5
🔗 Links:
• arXiv: https://arxiv.org/abs/2602.06036
• PDF: https://arxiv.org/pdf/2602.06036
• Project Page: https://z-lab.ai/projects/dflash/
• GitHub: https://github.com/z-lab/dflash ⭐ 3.1k
🤖 Models citing this paper:
• https://huggingface.co/z-lab/Qwen3.6-27B-DFlash
• https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash
• https://huggingface.co/z-lab/Qwen3.5-27B-DFlash
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Jackrong/qwen36-eval
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SpeculativeDecoding #BlockDiffusionModels #LargeLanguageModels #ParallelDecodingTechniques #FlashSpeculativeDecoding
💡 The paper introduces DFlash, a speculative decoding framework designed to improve the speed of large language models while maintaining their quality. The problem with current large language models is that they require sequential decoding, which leads to high latency and poor GPU utilization. Speculative decoding has been proposed as a solution, where a fast draft model generates outputs that are then verified in parallel by the target model. However, existing speculative decoding methods still rely on sequential drafting, which limits their speedup.
To address this, the authors propose using a lightweight block diffusion model for parallel drafting. This model generates draft tokens in a single forward pass and conditions the draft model on context features extracted from the target model. The result is a framework that enables efficient drafting with high-quality outputs and higher acceptance rates.
The experiments show that DFlash achieves significant speedup over existing autoregressive methods, with over 6x lossless acceleration across a range of models and tasks. This is up to 2.5x higher speedup than the state-of-the-art speculative decoding method. The method contributes to improving the efficiency of large language models, making them more suitable for practical applications. Overall, DFlash offers a promising solution for speeding up large language models without sacrificing their performance.
📅 Published on Feb 5
🔗 Links:
• arXiv: https://arxiv.org/abs/2602.06036
• PDF: https://arxiv.org/pdf/2602.06036
• Project Page: https://z-lab.ai/projects/dflash/
• GitHub: https://github.com/z-lab/dflash ⭐ 3.1k
🤖 Models citing this paper:
• https://huggingface.co/z-lab/Qwen3.6-27B-DFlash
• https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash
• https://huggingface.co/z-lab/Qwen3.5-27B-DFlash
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Jackrong/qwen36-eval
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SpeculativeDecoding #BlockDiffusionModels #LargeLanguageModels #ParallelDecodingTechniques #FlashSpeculativeDecoding
arXiv.org
DFlash: Block Diffusion for Flash Speculative Decoding
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding...