AI & ML Papers

🔥 UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

💡 The paper introduces UniVidX, a unified multimodal framework for versatile video generation using video diffusion model priors. The problem with existing methods is that they train separate models for each task, limiting the modeling of correlations across different modalities. UniVidX addresses this issue by formulating pixel-aligned tasks as conditional generation in a shared multimodal space, allowing it to adapt to modality-specific distributions while preserving the native priors of the video diffusion model.

The framework consists of three key designs: Stochastic Condition Masking, Decoupled Gated LoRA, and Cross-Modal Self-Attention. Stochastic Condition Masking enables omni-directional conditional generation by randomly partitioning modalities into clean conditions and noisy targets during training. Decoupled Gated LoRA preserves the strong priors of the video diffusion model by introducing per-modality LoRAs that are activated when a modality serves as the generation target. Cross-Modal Self-Attention facilitates information exchange and inter-modal alignment by sharing keys and values across modalities while keeping modality-specific queries.

The authors instantiate UniVidX in two domains: UniVid-Intrinsic for RGB videos and intrinsic maps, and UniVid-Alpha for blended RGB videos and their constituent RGBA layers. The results show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1000 videos. Overall, UniVidX provides a unified framework for versatile video generation, allowing for more efficient and effective modeling of correlations across different modalities.

📅 Published on May 1

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.00658
• PDF: https://arxiv.org/pdf/2605.00658
• Project Page: https://houyuanchen111.github.io/UniVidX.github.io/
• GitHub: https://github.com/houyuanchen111/UniVidX ⭐ 93

🤖 Models citing this paper:
• https://huggingface.co/houyuanchen/UniVidX

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalVideoGeneration #VideoDiffusionModels #ConditionalGeneration #CrossModalLearning #MultimodalFusionArchitectures

arXiv.org

UniVidX: A Unified Multimodal Framework for Versatile Video...

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem...

191 views04:59

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform